Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.
翻译:数据夏普利值为机器学习场景中的数据贡献归因提供了原则性框架。然而,现有方法需要在不同数据子集上重新训练模型,计算成本高昂,使其无法应用于大规模模型。此外,这些方法对学习算法运行产生的任何模型都生成相同的归因分数,意味着无法针对单次算法运行获得的特定模型进行定向归因。本文提出运行中数据夏普利值方法,通过为目标模型提供可扩展的数据归因方案,解决了上述局限性。在最有效的实现中,我们的技术与标准模型训练相比仅产生可忽略的额外运行时间。这一显著的效率提升首次使得基础模型预训练阶段的数据归因成为可能。我们通过多个案例研究揭示了预训练数据贡献的新见解,并探讨了其对生成式人工智能版权问题及预训练数据策管的影响。