Studying Large Language Model Generalization with Influence Functions

Roger Grosse,Juhan Bae,Cem Anil,Nelson Elhage,Alex Tamkin,Amirhossein Tajdini,Benoit Steiner,Dustin Li,Esin Durmus,Ethan Perez,Evan Hubinger,Kamilė Lukošiūtė,Karina Nguyen,Nicholas Joseph,Sam McCandlish,Jared Kaplan,Samuel R. Bowman

from arxiv, 119 pages, 47 figures, 22 tables

When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior? Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a given sequence were added to the training set? While influence functions have produced insights for small models, they are difficult to scale to large language models (LLMs) due to the difficulty of computing an inverse-Hessian-vector product (IHVP). We use the Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation to scale influence functions up to LLMs with up to 52 billion parameters. In our experiments, EK-FAC achieves similar accuracy to traditional influence function estimators despite the IHVP computation being orders of magnitude faster. We investigate two algorithmic techniques to reduce the cost of computing gradients of candidate training sequences: TF-IDF filtering and query batching. We use influence functions to investigate the generalization patterns of LLMs, including the sparsity of the influence patterns, increasing abstraction with scale, math and programming abilities, cross-lingual generalization, and role-playing behavior. Despite many apparently sophisticated forms of generalization, we identify a surprising limitation: influences decay to near-zero when the order of key phrases is flipped. Overall, influence functions give us a powerful new tool for studying the generalization properties of LLMs.

翻译：在试图更深入地洞察机器学习模型以理解并缓解相关风险时，一个潜在的重要证据来源是：哪些训练样本对特定行为贡献最大？影响函数旨在回答一个反事实问题：若将某个给定序列加入训练集，模型参数（进而是其输出）将如何变化？尽管影响函数曾为小模型带来洞察，但由于计算逆海森-向量乘积（IHVP）的困难，它难以扩展到大型语言模型（LLMs）。我们采用特征值校正的克罗内克因子近似曲率（EK-FAC）近似方法，将影响函数扩展到参数量高达520亿的LLMs。实验表明，尽管IHVP计算速度提升数个数量级，EK-FAC仍能达到与传统影响函数估计器相当的精度。我们研究了两种降低候选训练序列梯度计算成本的技术方法：TF-IDF过滤与查询批处理。利用影响函数，我们探究了LLMs的泛化模式，包括影响模式的稀疏性、随规模增长的抽象化能力、数学与编程能力、跨语言泛化以及角色扮演行为。尽管存在许多看似复杂的泛化形式，我们却发现了一个令人惊讶的局限性：当关键短语的顺序被颠倒时，影响值会衰减至接近零。总体而言，影响函数为我们研究LLMs泛化特性提供了强有力的新工具。