Traditional data influence estimation methods, like influence function, assume that learning algorithms are permutation-invariant with respect to training data. However, modern training paradigms, especially for foundation models using stochastic algorithms and multi-stage curricula, are sensitive to data ordering, thus violating this assumption. This mismatch renders influence functions inadequate for answering a critical question in machine learning: How can we capture the dependence of data influence on the optimization trajectory during training? To address this gap, we formalize the concept of trajectory-specific leave-one-out (LOO) influence, which quantifies the impact of removing a data point from a specific iteration during training, accounting for the exact sequence of data encountered and the model's optimization trajectory. However, exactly evaluating the trajectory-specific LOO presents a significant computational challenge. To address this, we propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO. Specifically, we compute a training data embedding that encapsulates the cumulative interactions between data and the evolving model parameters. The LOO can then be efficiently approximated through a simple dot-product between the data value embedding and the gradient of the given test data. As data value embedding captures training data ordering, it offers valuable insights into model training dynamics. In particular, we uncover distinct phases of data influence, revealing that data points in the early and late stages of training exert a greater impact on the final model. These insights translate into actionable strategies for managing the computational overhead of data selection by strategically timing the selection process, potentially opening new avenues in data curation research.
翻译:传统的数据影响估计方法(如影响函数)假设学习算法对训练数据具有置换不变性。然而,现代训练范式(尤其是使用随机算法和多阶段课程学习的基础模型)对数据顺序敏感,从而违背了这一假设。这种不匹配使得影响函数无法充分回答机器学习中的一个关键问题:如何捕捉数据影响对训练过程中优化轨迹的依赖性?为弥补这一不足,我们形式化了轨迹特定留一法影响的概念,该概念量化了在训练过程中从特定迭代步骤移除一个数据点所产生的影响,同时考虑了数据呈现的确切序列以及模型的优化轨迹。然而,精确评估轨迹特定留一法影响存在显著的计算挑战。为此,我们提出了数据价值嵌入这一新技术,能够高效近似轨迹特定留一法影响。具体而言,我们计算一个训练数据嵌入,该嵌入封装了数据与演化模型参数之间的累积交互作用。随后,通过计算数据价值嵌入与给定测试数据梯度之间的简单点积,即可高效近似留一法影响。由于数据价值嵌入捕捉了训练数据顺序,它为模型训练动态提供了有价值的见解。特别地,我们揭示了数据影响的不同阶段,发现训练早期和晚期的数据点对最终模型具有更大影响。这些见解可转化为管理数据选择计算开销的可操作策略,例如通过策略性地安排选择时机,从而可能为数据策展研究开辟新途径。