Influence functions (IF) have been seen as a technique for explaining model predictions through the lens of the training data. Their utility is assumed to be in identifying training examples "responsible" for a prediction so that, for example, correcting a prediction is possible by intervening on those examples (removing or editing them) and retraining the model. However, recent empirical studies have shown that the existing methods of estimating IF predict the leave-one-out-and-retrain effect poorly. In order to understand the mismatch between the theoretical promise and the practical results, we analyse five assumptions made by IF methods which are problematic for modern-scale deep neural networks and which concern convexity, numeric stability, training trajectory and parameter divergence. This allows us to clarify what can be expected theoretically from IF. We show that while most assumptions can be addressed successfully, the parameter divergence poses a clear limitation on the predictive power of IF: influence fades over training time even with deterministic training. We illustrate this theoretical result with BERT and ResNet models. Another conclusion from the theoretical analysis is that IF are still useful for model debugging and correcting even though some of the assumptions made in prior work do not hold: using natural language processing and computer vision tasks, we verify that mis-predictions can be successfully corrected by taking only a few fine-tuning steps on influential examples.
翻译:影响函数(Influence Functions, IF)被视为通过训练数据视角解释模型预测的技术。其应用价值在于识别对预测结果"负有责任"的训练样本,例如通过干预这些样本(删除或修改)并重新训练模型来修正预测。然而近期实证研究表明,现有IF估计方法对留一法重训练效应的预测效果不佳。为理解理论承诺与实践结果之间的差异,我们分析了IF方法中针对现代深度神经网络存在问题的五项假设,涉及凸性、数值稳定性、训练轨迹和参数发散性。这使我们得以阐明IF在理论层面可预期的表现。研究表明,尽管大多数假设可被成功处理,但参数发散性对IF的预测能力构成了明确限制:即使在确定性训练中,影响效应也会随训练时间衰减。我们通过BERT和ResNet模型验证了这一理论结果。理论分析的另一个结论是:尽管先前研究中的部分假设不成立,IF仍可用于模型调试与修正。通过自然语言处理和计算机视觉任务的验证,我们证实仅需对影响样本进行少量微调步骤即可成功修正错误预测。