The increasing adoption of web crawling opt-outs by copyright holders of online content raises critical questions about the impact of data compliance on large language model (LLM) performance. However, little is known about how these restrictions (and the resultant filtering of pretraining datasets) affect the capabilities of models trained using these corpora. In this work, we conceptualize this effect as the $\textit{data compliance gap}$ (DCG), which quantifies the performance difference between models trained on datasets that comply with web crawling opt-outs, and those that do not. We measure the data compliance gap in two settings: pretraining models from scratch and continual pretraining from existing compliant models (simulating a setting where copyrighted data could be integrated later in pretraining). Our experiments with 1.5B models show that, as of January 2025, compliance with web data opt-outs does not degrade general knowledge acquisition (close to 0\% DCG). However, in specialized domains such as biomedical research, excluding major publishers leads to performance declines. These findings suggest that while general-purpose LLMs can be trained to perform equally well using fully open data, performance in specialized domains may benefit from access to high-quality copyrighted sources later in training. Our study provides empirical insights into the long-debated trade-off between data compliance and downstream model performance, informing future discussions on AI training practices and policy decisions. Our website is available at https://data-compliance.github.io/.
翻译:随着网络内容版权持有者日益广泛地采用网络爬虫退出机制,数据合规性对大语言模型性能的影响已成为关键议题。然而,这些限制(以及由此产生的预训练数据集过滤)如何影响基于此类语料库训练的模型能力,目前尚不明确。本研究将这种影响概念化为$\textit{数据合规性差距}$,用于量化基于遵守网络爬虫退出机制的数据集训练的模型与未遵守该机制训练的模型之间的性能差异。我们在两种场景下测量数据合规性差距:从头开始预训练模型,以及基于现有合规模型进行持续预训练(模拟受版权保护数据在预训练后期被整合的场景)。我们使用15亿参数模型的实验表明,截至2025年1月,遵守网络数据退出机制并未损害通用知识获取能力(数据合规性差距接近0%)。然而,在生物医学研究等专业领域,排除主要出版商会导致性能下降。这些发现表明,虽然通用大语言模型可以使用完全开放数据训练达到同等性能,但专业领域的性能可能受益于在训练后期获取受版权保护的高质量数据源。本研究为长期争议的数据合规性与下游模型性能之间的权衡关系提供了实证依据,为未来人工智能训练实践和政策决策的讨论提供了参考。项目网站详见https://data-compliance.github.io/。