Taming Hyperparameter Sensitivity in Data Attribution: Practical Selection Without Costly Retraining

Data attribution methods, which quantify the influence of individual training data points on a machine learning model, have gained increasing popularity in data-centric applications in modern AI. Despite a recent surge of new methods developed in this space, the impact of hyperparameter tuning in these methods remains under-explored. In this work, we present the first large-scale empirical study to understand the hyperparameter sensitivity of common data attribution methods. Our results show that most methods are indeed sensitive to certain key hyperparameters. However, unlike typical machine learning algorithms -- whose hyperparameters can be tuned using computationally-cheap validation metrics -- evaluating data attribution performance often requires retraining models on subsets of training data, making such metrics prohibitively costly for hyperparameter tuning. This poses a critical open challenge for the practical application of data attribution methods. To address this challenge, we advocate for better theoretical understandings of hyperparameter behavior to inform efficient tuning strategies. As a case study, we provide a theoretical analysis of the regularization term that is critical in many variants of influence function methods. Building on this analysis, we propose a lightweight procedure for selecting the regularization value without model retraining, and validate its effectiveness across a range of standard data attribution benchmarks. Overall, our study identifies a fundamental yet overlooked challenge in the practical application of data attribution, and highlights the importance of careful discussion on hyperparameter selection in future method development.

翻译：数据归因方法旨在量化单个训练数据点对机器学习模型的影响，在现代人工智能的数据中心应用中日益受到关注。尽管近期该领域涌现出众多新方法，但超参数调优对这些方法的影响仍未得到充分探究。本研究首次通过大规模实证分析，系统探究常见数据归因方法的超参数敏感性。实验结果表明，大多数方法确实对某些关键超参数表现出敏感性。然而，与可通过计算成本较低的验证指标进行超参数调优的典型机器学习算法不同，评估数据归因性能通常需要在训练数据子集上重新训练模型，导致此类评估指标在超参数调优中成本过高。这为数据归因方法的实际应用提出了一个关键且尚未解决的挑战。为应对这一挑战，我们主张通过深化对超参数行为的理论理解来指导高效调优策略。作为案例研究，我们对影响函数方法众多变体中至关重要的正则化项进行了理论分析。基于此分析，我们提出了一种无需模型重训练的轻量级正则化值选择流程，并在系列标准数据归因基准测试中验证了其有效性。总体而言，本研究揭示了数据归因实际应用中一个基础但被长期忽视的挑战，并强调了在未来方法开发中深入探讨超参数选择的重要性。