Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. Grounded in perturbation theory and information geometry, Jacobian Scopes quantify how input tokens influence various aspects of a model's prediction, such as specific logits, the full predictive distribution, and model uncertainty (effective temperature). Through case studies spanning instruction understanding, translation, and in-context learning (ICL), we demonstrate how Jacobian Scopes reveal implicit political biases, uncover word- and phrase-level translation strategies, and shed light on recently debated mechanisms underlying in-context time-series forecasting. To facilitate exploration of Jacobian Scopes on custom text, we open-source our implementations and provide a cloud-hosted interactive demo at https://huggingface.co/spaces/Typony/JacobianScopes.
翻译:大语言模型(LLM)基于上下文中的线索(如语义描述和上下文示例)进行下一词元预测。然而,由于现代架构中层和注意力头数量的激增,阐明哪些先验词元对给定预测影响最强仍具有挑战性。我们提出雅可比作用域(Jacobian Scopes),这是一套基于梯度的词元级因果归因方法,用于解释LLM预测。该方法植根于微扰理论与信息几何,通过雅可比作用域量化输入词元如何影响模型预测的多个方面,包括特定逻辑值、完整预测分布及模型不确定性(有效温度)。通过涵盖指令理解、翻译和上下文学习(ICL)的案例研究,我们展示了雅可比作用域如何揭示隐性政治偏见、发现词级和短语级翻译策略,并阐明近期备受争议的上下文时序预测机制。为便于在自定义文本上探索雅可比作用域,我们开源了实现代码,并在https://huggingface.co/spaces/Typony/JacobianScopes 提供云端交互式演示。