General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how codified data use preferences are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crises in data consent, for both developers and creators. The foreclosure of much of the open web will impact not only commercial AI, but also non-commercial AI and academic research.
翻译:通用人工智能系统建立在海量公共网络数据之上,这些数据被整合为C4、RefinedWeb和Dolma等语料库。据我们所知,我们首次对AI训练语料库所基于的网站域名的同意协议进行了大规模纵向审计。通过对14,000个网站域名的审计,我们获得了可爬取网络数据的全景视图,并揭示了编码化的数据使用偏好如何随时间演变。我们观察到限制使用的AI专用条款激增、对AI开发者的限制存在显著差异,以及网站服务条款中声明的意图与其robots.txt文件之间的普遍不一致性。我们诊断这些现象是低效网络协议的症候——这些协议并非为应对互联网被广泛重用于AI训练而设计。纵向分析表明,在短短一年内(2023-2024年),网络数据源的限制措施急剧增加,导致C4语料库中约5%以上的全部词元(或28%以上最活跃维护的关键数据源)被完全禁止使用。就服务条款爬取限制而言,目前C4语料库中高达45%的内容受到限制。若这些限制被遵守或强制执行,将迅速影响通用人工智能系统的数据多样性、时效性和扩展规律。我们旨在阐明数据同意机制对开发者和创作者双方正在显现的危机。开放网络的大规模封闭不仅将影响商业AI,也将波及非商业AI及学术研究领域。