Data is a critical asset for training large language models (LLMs), alongside compute resources and skilled workers. While some training data is publicly available, substantial investment is required to generate proprietary datasets, such as human preference annotations or to curate new ones from existing sources. As larger datasets generally yield better model performance, two natural questions arise. First, how can data owners make informed decisions about curation strategies and data sources investment? Second, how can multiple data owners collaboratively pool their resources to train superior models while fairly distributing the benefits? This problem, data valuation, which is not specific to large language models, has been addressed by the machine learning community through the lens of cooperative game theory, with the Shapley value being the prevalent solution concept. However, computing Shapley values is notoriously expensive for data valuation, typically requiring numerous model retrainings, which can become prohibitive for large machine learning models. In this work, we demonstrate that this computational challenge is dramatically simplified for LLMs trained with Direct Preference Optimization (DPO). We show how the specific mathematical structure of DPO enables scalable Shapley value computation. We believe this observation unlocks many applications at the intersection of data valuation and large language models.
翻译:数据是训练大型语言模型(LLMs)的关键资产,与计算资源和专业人员同等重要。虽然部分训练数据可公开获取,但生成专有数据集(如人类偏好标注)或从现有资源中筛选新数据集仍需大量投入。由于更大规模的数据集通常能带来更优的模型性能,两个核心问题随之产生:首先,数据所有者如何基于量化评估制定数据筛选策略与投资决策?其次,多个数据所有者如何通过资源协作训练更优模型,同时公平分配收益?这一不限于大语言模型的"数据价值评估"问题,机器学习学界已通过合作博弈理论框架进行探讨,其中沙普利值成为主流的解决方案。然而,沙普利值的计算在数据价值评估中因需大量模型重训练而成本高昂,对于大型机器学习模型尤为突出。本研究证明,对于采用直接偏好优化(DPO)训练的LLMs,该计算挑战可得到显著简化。我们揭示了DPO特有的数学结构如何实现可扩展的沙普利值计算,这一发现有望为数据价值评估与大型语言模型的交叉领域开启多重应用前景。