Data valuation quantifies the value of training data, and is used for data attribution (i.e., determining the contribution of training data towards model predictions), and data selection; both of which are important for curating high-quality datasets to train large language models. In our paper, we show that data valuation through in-context probing (i.e., prompting a LLM) approximates influence functions for selecting training data. We provide a theoretical sketch on this connection based on transformer models performing "implicit" gradient descent on its in-context inputs. Our empirical findings show that in-context probing and gradient-based influence frameworks are similar in how they rank training data. Furthermore, fine-tuning experiments on data selected by either method reveal similar model performance.
翻译:数据价值评估量化训练数据的价值,用于数据归因(即确定训练数据对模型预测的贡献)和数据选择;这两者对于构建高质量数据集以训练大语言模型都至关重要。在本文中,我们展示了通过上下文探测(即提示大语言模型)进行的数据价值评估近似于用于选择训练数据的影响函数。我们基于Transformer模型对其上下文输入执行“隐式”梯度下降的理论框架,对此关联进行了理论概述。我们的实证结果表明,上下文探测与基于梯度的影响框架在训练数据排序方面具有相似性。此外,通过任一方法选择数据进行微调实验,均显示出相似的模型性能。