Through the Looking-Glass: Tracing Shifts in AI Data Consent across the Web

Abstract
Modern, general-purpose artificial intelligence (AI) systems are largely built on massive swathes of public web data, which have been assembled into datasets such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale audit of the consent and provenance information of the web domains underlying general-purpose AI training corpora. Our audit of 14,000 web domains provides an expansive view of the nature of crawlable content available on the web. We conduct a temporal analysis of how content access has changed over time, and we show that since 2016 there has been a rapid crescendo of restrictive policies from web sources, a proliferation of new AI-specific clauses to limit use, and acute differences between how restrictions apply across AI organization crawlers. We note contradictions, inconsistencies and asymmetries between the intentions websites express in their terms of service and in their instructions to web crawlers. Our longitudinal analyses allow us to forecast the extent to which the development of responsible, general-purpose AI will be hindered by shifting governance on the web---by mid 2025, we forecast that 22\% of the data available in a Common Crawl dump from 2019 will be restricted for use in model training by robots.txt or terms of service, foreclosing much of the highest quality training data from the web.