The increasing adoption of web crawling opt-outs by copyright holders of online content raises critical questions about the impact of data compliance on large language model (LLM) performance. However, little is known about how these restrictions (and the resultant filtering of pretraining datasets) affect the capabilities of models trained using these corpora. In this work, we conceptualize this effect as the $\textit{data compliance gap}$ (DCG), which quantifies the performance difference between models trained on datasets that comply with web crawling opt-outs, and those that do not. We measure the data compliance gap in two settings: pretraining models from scratch and continual pretraining from existing compliant models (simulating a setting where copyrighted data could be integrated later in pretraining). Our experiments with 1.5B models show that, as of January 2025, compliance with web data opt-outs does not degrade general knowledge acquisition (close to 0\% DCG). However, in specialized domains such as biomedical research, excluding major publishers leads to performance declines. These findings suggest that while general-purpose LLMs can be trained to perform equally well using fully open data, performance in specialized domains may benefit from access to high-quality copyrighted sources later in training. Our study provides empirical insights into the long-debated trade-off between data compliance and downstream model performance, informing future discussions on AI training practices and policy decisions.
翻译:在线内容版权持有者对网络爬虫退出机制的日益广泛采用,引发了一个关键问题:数据合规性对大型语言模型(LLM)性能的影响究竟如何。然而,关于这些限制(以及由此产生的预训练数据集过滤)如何影响基于这些语料库训练的模型能力,目前知之甚少。在本研究中,我们将这种影响概念化为 $\textit{数据合规性差距}$(DCG),它量化了在遵守网络爬虫退出机制的数据集上训练的模型与不遵守该机制的数据集上训练的模型之间的性能差异。我们在两种设置下测量数据合规性差距:从头开始预训练模型,以及从现有合规模型进行持续预训练(模拟一种在预训练后期可能整合受版权保护数据的场景)。我们对1.5B参数模型的实验表明,截至2025年1月,遵守网络数据退出机制并不会损害通用知识的获取(DCG接近0%)。然而,在生物医学研究等专业领域,排除主要出版商的数据会导致性能下降。这些发现表明,虽然通用大型语言模型可以使用完全开放的数据进行训练并达到同等性能,但在专业领域的表现可能受益于在训练后期获取高质量的受版权保护资源。我们的研究为长期争论的数据合规性与下游模型性能之间的权衡提供了实证见解,为未来关于人工智能训练实践和政策决策的讨论提供了参考。