高性能大型语言模型能否合乎伦理？量化网络爬虫退出机制的影响 (Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs)

The increasing adoption of web crawling opt-outs by copyright holders of online content raises critical questions about the impact of data compliance on large language model (LLM) performance. However, little is known about how these restrictions (and the resultant filtering of pretraining datasets) affect the capabilities of models trained using these corpora. In this work, we conceptualize this effect as the $\textit{data compliance gap}$ (DCG), which quantifies the performance difference between models trained on datasets that comply with web crawling opt-outs, and those that do not. We measure the data compliance gap in two settings: pretraining models from scratch and continual pretraining from existing compliant models (simulating a setting where copyrighted data could be integrated later in pretraining). Our experiments with 1.5B models show that, as of January 2025, compliance with web data opt-outs does not degrade general knowledge acquisition (close to 0\% DCG). However, in specialized domains such as biomedical research, excluding major publishers leads to performance declines. These findings suggest that while general-purpose LLMs can be trained to perform equally well using fully open data, performance in specialized domains may benefit from access to high-quality copyrighted sources later in training. Our study provides empirical insights into the long-debated trade-off between data compliance and downstream model performance, informing future discussions on AI training practices and policy decisions.

翻译：在线内容版权持有者对网络爬虫退出机制的日益广泛采用，引发了一个关键问题：数据合规性对大型语言模型（LLM）性能的影响究竟如何。然而，关于这些限制（以及由此产生的预训练数据集过滤）如何影响基于这些语料库训练的模型能力，目前知之甚少。在本研究中，我们将这种影响概念化为 $\textit{数据合规性差距}$（DCG），它量化了在遵守网络爬虫退出机制的数据集上训练的模型与不遵守该机制的数据集上训练的模型之间的性能差异。我们在两种设置下测量数据合规性差距：从头开始预训练模型，以及从现有合规模型进行持续预训练（模拟一种在预训练后期可能整合受版权保护数据的场景）。我们对1.5B参数模型的实验表明，截至2025年1月，遵守网络数据退出机制并不会损害通用知识的获取（DCG接近0%）。然而，在生物医学研究等专业领域，排除主要出版商的数据会导致性能下降。这些发现表明，虽然通用大型语言模型可以使用完全开放的数据进行训练并达到同等性能，但在专业领域的表现可能受益于在训练后期获取高质量的受版权保护资源。我们的研究为长期争论的数据合规性与下游模型性能之间的权衡提供了实证见解，为未来关于人工智能训练实践和政策决策的讨论提供了参考。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日