How Do Data Owners Say No? A Case Study of Data Consent Mechanisms in Web-Scraped Vision-Language AI Training Datasets

The internet has become the main source of data to train modern text-to-image or vision-language models, yet it is increasingly unclear whether web-scale data collection practices for training AI systems adequately respect data owners' wishes. Ignoring the owner's indication of consent around data usage not only raises ethical concerns but also has recently been elevated into lawsuits around copyright infringement cases. In this work, we aim to reveal information about data owners' consent to AI scraping and training, and study how it's expressed in DataComp, a popular dataset of 12.8 billion text-image pairs. We examine both the sample-level information, including the copyright notice, watermarking, and metadata, and the web-domain-level information, such as a site's Terms of Service (ToS) and Robots Exclusion Protocol. We estimate at least 122M of samples exhibit some indication of copyright notice in CommonPool, and find that 60\% of the samples in the top 50 domains come from websites with ToS that prohibit scraping. Furthermore, we estimate 9-13\% with 95\% confidence interval of samples from CommonPool to contain watermarks, where existing watermark detection methods fail to capture them in high fidelity. Our holistic methods and findings show that data owners rely on various channels to convey data consent, of which current AI data collection pipelines do not entirely respect. These findings highlight the limitations of the current dataset curation/release practice and the need for a unified data consent framework taking AI purposes into consideration.

翻译：互联网已成为训练现代文本到图像或视觉语言模型的主要数据来源，然而，为训练AI系统而进行的网络规模数据收集实践是否充分尊重数据所有者的意愿，这一问题正日益模糊。忽视所有者关于数据使用的同意表示不仅引发伦理担忧，近期还升级为围绕版权侵权案件的诉讼。在本研究中，我们旨在揭示数据所有者对AI抓取和训练的同意信息，并研究其在DataComp（一个包含128亿文本-图像对的热门数据集）中如何被表达。我们考察了样本级信息（包括版权声明、水印和元数据）以及网络域级信息（例如网站的《服务条款》和《机器人排除协议》）。我们估计CommonPool中至少有1.22亿样本显示出某种版权声明迹象，并发现排名前50的域名中60%的样本来自具有禁止抓取的《服务条款》的网站。此外，我们估计CommonPool中9-13%（95%置信区间）的样本含有水印，而现有水印检测方法无法高保真地捕获这些水印。我们的整体方法和结果表明，数据所有者依赖多种渠道传达数据同意，而当前AI数据收集流程并未完全尊重这些渠道。这些发现凸显了当前数据集整理/发布实践的局限性，以及建立考虑AI目的的统一数据同意框架的必要性。