Through the Looking-Glass: Tracing Shifts in AI Data Consent across the Web

Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Shamiso Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, Enrico Shippole, Jianguo Zhang, Joanna Materzynska, Kun Qian, Kushagra Tiwary, Lester James V. Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma, Minh Chien Vu, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, Alex Pentland

Poster

Abstract

Modern, general-purpose artificial intelligence (AI) systems are largely built on massive swathes of public web data, which have been assembled into datasets such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale audit of the consent and provenance information of the web domains underlying general-purpose AI training corpora. Our audit of 14,000 web domains provides an expansive view of the nature of crawlable content available on the web. We conduct a temporal analysis of how content access has changed over time, and we show that since 2016 there has been a rapid crescendo of restrictive policies from web sources, a proliferation of new AI-specific clauses to limit use, and acute differences between how restrictions apply across AI organization crawlers. We note contradictions, inconsistencies and asymmetries between the intentions websites express in their terms of service and in their instructions to web crawlers. Our longitudinal analyses allow us to forecast the extent to which the development of responsible, general-purpose AI will be hindered by shifting governance on the web---by mid 2025, we forecast that 22\% of the data available in a Common Crawl dump from 2019 will be restricted for use in model training by robots.txt or terms of service, foreclosing much of the highest quality training data from the web.