We investigate the contents of web-scraped data for training AI systems, at sizes where human dataset curators and compilers no longer manually annotate every sample. Building off of prior privacy concerns in machine learning models, we ask: What are the legal privacy implications of web-scraped machine learning datasets? In an empirical study of a popular training dataset, we find significant presence of personally identifiable information despite sanitization efforts. Our audit provides concrete evidence to support the concern that any large-scale web-scraped dataset may contain legally defined personal data. We use these findings of a real-world dataset to inform our legal analysis with respect to existing privacy and data protection laws. We surface various legal risks of current data curation practices that may propagate personal information to train downstream models. Based on our empirical and legal analyses, we argue for reorientation of current frameworks of "publicly available" information to meaningfully limit the development of AI built upon indiscriminate scraping of the internet.
翻译:我们研究了用于训练AI系统的网络抓取数据内容,其规模已远超人类数据集管理员和编纂者手动标注每个样本的能力。基于机器学习模型此前存在的隐私问题,我们追问:网络抓取机器学习数据集在法律隐私方面有何影响?通过对一个流行训练数据集的实证研究,我们发现尽管有清洗措施,个人身份信息仍显著存在。我们的审计为以下担忧提供了具体证据:任何大规模网络抓取数据集都可能包含法律定义的个人数据。我们利用这一真实世界数据集的发现,结合现行隐私与数据保护法律开展法律分析,揭示了当前数据编纂实践中可能将个人信息传播至下游模型训练的各种法律风险。基于实证与法律分析,我们主张重新定义"公开可用"信息的现行框架,以切实限制基于无差别互联网抓取的AI开发。