高性能大语言模型能否合乎伦理？量化网络爬虫退出机制的影响 (Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs)

The increasing adoption of web crawling opt-outs by copyright holders of online content raises critical questions about the impact of data compliance on large language model (LLM) performance. However, little is known about how these restrictions (and the resultant filtering of pretraining datasets) affect the capabilities of models trained using these corpora. In this work, we conceptualize this effect as the $\textit{data compliance gap}$ (DCG), which quantifies the performance difference between models trained on datasets that comply with web crawling opt-outs, and those that do not. We measure the data compliance gap in two settings: pretraining models from scratch and continual pretraining from existing compliant models (simulating a setting where copyrighted data could be integrated later in pretraining). Our experiments with 1.5B models show that, as of January 2025, compliance with web data opt-outs does not degrade general knowledge acquisition (close to 0\% DCG). However, in specialized domains such as biomedical research, excluding major publishers leads to performance declines. These findings suggest that while general-purpose LLMs can be trained to perform equally well using fully open data, performance in specialized domains may benefit from access to high-quality copyrighted sources later in training. Our study provides empirical insights into the long-debated trade-off between data compliance and downstream model performance, informing future discussions on AI training practices and policy decisions. Our website is available at https://data-compliance.github.io/.

翻译：随着网络内容版权持有者日益广泛地采用网络爬虫退出机制，数据合规性对大语言模型性能的影响已成为关键议题。然而，这些限制（以及由此产生的预训练数据集过滤）如何影响基于此类语料库训练的模型能力，目前尚不明确。本研究将这种影响概念化为$\textit{数据合规性差距}$，用于量化基于遵守网络爬虫退出机制的数据集训练的模型与未遵守该机制训练的模型之间的性能差异。我们在两种场景下测量数据合规性差距：从头开始预训练模型，以及基于现有合规模型进行持续预训练（模拟受版权保护数据在预训练后期被整合的场景）。我们使用15亿参数模型的实验表明，截至2025年1月，遵守网络数据退出机制并未损害通用知识获取能力（数据合规性差距接近0%）。然而，在生物医学研究等专业领域，排除主要出版商会导致性能下降。这些发现表明，虽然通用大语言模型可以使用完全开放数据训练达到同等性能，但专业领域的性能可能受益于在训练后期获取受版权保护的高质量数据源。本研究为长期争议的数据合规性与下游模型性能之间的权衡关系提供了实证依据，为未来人工智能训练实践和政策决策的讨论提供了参考。项目网站详见https://data-compliance.github.io/。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日