Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. By analyzing the linearized relations of final hidden state with respect to inputs, Jacobian Scopes quantify how input tokens influence a model's prediction. We introduce three variants - Semantic, Fisher, and Temperature Scopes - which respectively target sensitivity of specific logits, the full predictive distribution, and model confidence (inverse temperature). Through case studies spanning instruction understanding, translation and in-context learning (ICL), we uncover interesting findings, such as when Jacobian Scopes point to implicit political biases. We believe that our proposed methods also shed light on recently debated mechanisms underlying in-context time-series forecasting. Our code and interactive demonstrations are publicly available at https://github.com/AntonioLiu97/JacobianScopes.
翻译:大语言模型(LLMs)基于其上下文中的线索(如语义描述和上下文示例)进行下一个词元的预测。然而,由于现代架构中层和注意力头数量的激增,阐明哪些先前的词元对特定预测产生最强影响仍然具有挑战性。我们提出了雅可比作用域,一套基于梯度的、词元级因果归因方法,用于解释LLM的预测。通过分析最终隐藏状态相对于输入的线性化关系,雅可比作用域量化了输入词元如何影响模型的预测。我们引入了三种变体——语义作用域、费希尔作用域和温度作用域——它们分别针对特定对数几率的敏感性、完整预测分布以及模型置信度(逆温度)。通过涵盖指令理解、翻译和上下文学习(ICL)的案例研究,我们揭示了有趣的发现,例如当雅可比作用域指向隐含的政治偏见时。我们相信,所提出的方法也为近期备受争议的上下文时间序列预测底层机制提供了启示。我们的代码和交互式演示已在 https://github.com/AntonioLiu97/JacobianScopes 公开提供。