Consent in Crisis: The Rapid Decline of the AI Data Commons

Shayne Longpre,Robert Mahari,Ariel Lee,Campbell Lund,Hamidah Oderinwale,William Brannon,Nayan Saxena,Naana Obeng-Marnu,Tobin South,Cole Hunter,Kevin Klyman,Christopher Klamm,Hailey Schoelkopf,Nikhil Singh,Manuel Cherep,Ahmad Anis,An Dinh,Caroline Chitongo,Da Yin,Damien Sileo,Deividas Mataciunas,Diganta Misra,Emad Alghamdi,Enrico Shippole,Jianguo Zhang,Joanna Materzynska,Kun Qian,Kush Tiwary,Lester Miranda,Manan Dey,Minnie Liang,Mohammed Hamdy,Niklas Muennighoff,Seonghyeon Ye,Seungone Kim,Shrestha Mohanty,Vipul Gupta,Vivek Sharma,Vu Minh Chien,Xuhui Zhou,Yizhi Li,Caiming Xiong,Luis Villa,Stella Biderman,Hanlin Li,Daphne Ippolito,Sara Hooker,Jad Kabbara,Sandy Pentland

from arxiv, 41 pages (13 main), 5 figures, 9 tables

General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how codified data use preferences are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crises in data consent, for both developers and creators. The foreclosure of much of the open web will impact not only commercial AI, but also non-commercial AI and academic research.

翻译：通用人工智能系统建立在海量公共网络数据之上，这些数据被整合为C4、RefinedWeb和Dolma等语料库。据我们所知，我们首次对AI训练语料库所基于的网站域名的同意协议进行了大规模纵向审计。通过对14,000个网站域名的审计，我们获得了可爬取网络数据的全景视图，并揭示了编码化的数据使用偏好如何随时间演变。我们观察到限制使用的AI专用条款激增、对AI开发者的限制存在显著差异，以及网站服务条款中声明的意图与其robots.txt文件之间的普遍不一致性。我们诊断这些现象是低效网络协议的症候——这些协议并非为应对互联网被广泛重用于AI训练而设计。纵向分析表明，在短短一年内（2023-2024年），网络数据源的限制措施急剧增加，导致C4语料库中约5%以上的全部词元（或28%以上最活跃维护的关键数据源）被完全禁止使用。就服务条款爬取限制而言，目前C4语料库中高达45%的内容受到限制。若这些限制被遵守或强制执行，将迅速影响通用人工智能系统的数据多样性、时效性和扩展规律。我们旨在阐明数据同意机制对开发者和创作者双方正在显现的危机。开放网络的大规模封闭不仅将影响商业AI，也将波及非商业AI及学术研究领域。