Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. Grounded in perturbation theory and information geometry, Jacobian Scopes quantify how input tokens influence various aspects of a model's prediction, such as specific logits, the full predictive distribution, and model uncertainty (effective temperature). Through case studies spanning instruction understanding, translation, and in-context learning (ICL), we demonstrate how Jacobian Scopes reveal implicit political biases, uncover word- and phrase-level translation strategies, and shed light on recently debated mechanisms underlying in-context time-series forecasting. To facilitate exploration of Jacobian Scopes on custom text, we open-source our implementations and provide a cloud-hosted interactive demo at https://huggingface.co/spaces/Typony/JacobianScopes.
翻译:大型语言模型(LLMs)通过上下文中的线索(如语义描述和上下文示例)进行下一个词元的预测。然而,由于现代架构中层数和注意力头的激增,阐明哪些先前词元对给定预测影响最大仍具挑战性。我们提出雅可比作用域(Jacobian Scopes),这是一套基于梯度的词元级因果归因方法,用于解释LLM的预测。基于微扰理论与信息几何,雅可比作用域能够量化输入词元如何影响模型预测的各个方面,例如特定logits、完整预测分布及模型不确定性(有效温度)。通过涵盖指令理解、翻译和上下文学习(ICL)的案例研究,我们展示了雅可比作用域如何揭示隐式政治偏见、挖掘词级与短语级翻译策略,并阐明近期备受争议的上下文时间序列预测机制。为便于在自定义文本上探索雅可比作用域,我们开源了实现代码,并在 https://huggingface.co/spaces/Typony/JacobianScopes 提供了云端交互式演示。