Influence functions provide a principled method to assess the contribution of individual training samples to a specific target. Yet, their high computational costs limit their applications on large-scale models and datasets. Existing methods proposed for influence function approximation have significantly reduced the computational overheads. However, they mostly suffer from inaccurate estimation due to the lack of strong convergence guarantees from the algorithm. The family of hyperpower methods are well-known for their rigorous convergence guarantees on matrix inverse approximation, while the matrix multiplication operation can involve intractable memory and computation costs on large-scale models. We propose HyperINF, an efficient and accurate influence function approximation method which leverages the hyperpower method, specifically Schulz's iterative algorithm. To deal with the computation-intensive matrix multiplication, we incorporate the generalized fisher information (GFIM) as a low-rank approximation of the Hessian matrix, which reduces the memory and computation overheads to constant costs independent of ranks on LoRA-tuned models. We first demonstrate the superior accuracy and stability of \method compared to other baselines through a synthetic convergence simulation for matrix inversion. We further validate the efficacy of \method through extensive real-world data attribution tasks, including mislabeled data detection and data selection for LLM and VLM fine-tuning. On LoRA-tuned models, HyperINF achieves superior downstream performance with minimal memory and computational overhead, while other baselines suffer from significant degradation. Our codebase is available at https://github.com/Blackzxy/HyperINF.
翻译:影响函数提供了一种原则性方法来评估单个训练样本对特定目标的贡献。然而,其高昂的计算成本限制了其在大规模模型和数据集上的应用。现有提出的影响函数近似方法已显著降低了计算开销,但由于算法缺乏强收敛性保证,这些方法大多存在估计不准确的问题。超幂方法族以其在矩阵逆近似上严格的收敛性保证而闻名,但矩阵乘法操作在大规模模型上可能涉及难以处理的内存和计算成本。我们提出了HyperINF,一种高效且准确的影响函数近似方法,它利用了超幂方法,特别是舒尔茨迭代算法。为了应对计算密集的矩阵乘法,我们引入广义费雪信息矩阵作为Hessian矩阵的低秩近似,这将在LoRA调优模型上的内存和计算开销降低至与秩无关的常数成本。我们首先通过矩阵求逆的合成收敛模拟,证明了\method相比其他基线方法在准确性和稳定性上的优越性。我们进一步通过广泛的真实世界数据归因任务验证了\method的有效性,包括错误标签数据检测以及用于LLM和VLM微调的数据选择。在LoRA调优模型上,HyperINF以最小的内存和计算开销实现了优越的下游性能,而其他基线方法则出现显著性能下降。我们的代码库可在 https://github.com/Blackzxy/HyperINF 获取。