Assessing the impact the training data on machine learning models is crucial for understanding the behavior of the model, enhancing the transparency, and selecting training data. Influence function provides a theoretical framework for quantifying the effect of training data points on model's performance given a specific test data. However, the computational and memory costs of influence function presents significant challenges, especially for large-scale models, even when using approximation methods, since the gradients involved in computation are as large as the model itself. In this work, we introduce a novel approach that leverages dropout as a gradient compression mechanism to compute the influence function more efficiently. Our method significantly reduces computational and memory overhead, not only during the influence function computation but also in gradient compression process. Through theoretical analysis and empirical validation, we demonstrate that our method could preserves critical components of the data influence and enables its application to modern large-scale models.
翻译:评估训练数据对机器学习模型的影响,对于理解模型行为、增强透明度以及选择训练数据至关重要。影响函数为量化特定测试数据下训练数据点对模型性能的影响提供了理论框架。然而,即便采用近似方法,影响函数的计算和存储代价仍构成重大挑战,尤其是对于大规模模型而言,因为计算中涉及的梯度与模型本身规模相当。本文提出了一种新颖方法,利用Dropout作为梯度压缩机制来更高效地计算影响函数。我们的方法不仅在影响函数计算过程中,还在梯度压缩过程中显著降低了计算和存储开销。通过理论分析与实证验证,我们证明该方法能够保留数据影响的关键组成部分,并使其适用于现代大规模模型。