Interpretability research on large language models (LLMs) has yielded important insights into model behaviour, yet recurring pitfalls persist: findings that do not generalise, and causal interpretations that outrun the evidence. Our position is that causal inference specifies what constitutes a valid mapping from model activations to invariant high-level structures, the data or assumptions needed to achieve it, and the inferences it can support. Specifically, Pearl's causal hierarchy clarifies what an interpretability study can justify. Observations establish associations between model behaviour and internal components. Interventions (e.g., ablations or activation patching) support claims how these edits affect a behavioural metric (\eg, average change in token probabilities) over a set of prompts. However, counterfactual claims -- i.e., asking what the model output would have been for the same prompt under an unobserved intervention -- remain largely unverifiable without controlled supervision. We show how causal representation learning (CRL) operationalises this hierarchy, specifying which variables are recoverable from activations and under what assumptions. Together, these motivate a diagnostic framework that helps practitioners select methods and evaluations matching claims to evidence such that findings generalise.
翻译:大型语言模型(LLMs)的可解释性研究为模型行为提供了重要见解,但反复出现的缺陷依然存在:研究结果无法泛化,以及因果解释超出证据支持范围。我们的观点是,因果推断明确了从模型激活到不变高层结构的有效映射所需的条件、实现该映射所需的数据或假设,以及该映射所能支持的推断。具体而言,Pearl的因果层级结构阐明了可解释性研究能够证明的内容。观察性研究建立了模型行为与内部组件之间的关联。干预性研究(例如,消融或激活修补)支持关于这些编辑如何影响一组提示词上的行为指标(例如,词元概率的平均变化)的主张。然而,反事实主张——即询问在未观察到的干预下,模型对同一提示词的输出会是什么——在没有受控监督的情况下基本上无法验证。我们展示了因果表示学习(CRL)如何将这一层级结构操作化,明确了哪些变量可以从激活中恢复以及在何种假设下恢复。这些共同促成了一个诊断框架,帮助实践者选择与证据相匹配的方法和评估方式,从而使研究发现能够泛化。