Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional subspaces of activations. Specifically, recent studies have explored subspace interventions (such as activation patching) as a way to simultaneously manipulate model behavior and attribute the features behind it to given subspaces. In this work, we demonstrate that these two aims diverge, potentially leading to an illusory sense of interpretability. Counterintuitively, even if a subspace intervention makes the model's output behave as if the value of a feature was changed, this effect may be achieved by activating a dormant parallel pathway leveraging another subspace that is causally disconnected from model outputs. We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice. In the context of factual recall, we further show a link to rank-1 fact editing, providing a mechanistic explanation for previous work observing an inconsistency between fact editing performance and fact localization. However, this does not imply that activation patching of subspaces is intrinsically unfit for interpretability. To contextualize our findings, we also show what a success case looks like in a task (indirect object identification) where prior manual circuit analysis informs an understanding of the location of a feature. We explore the additional evidence needed to argue that a patched subspace is faithful.
翻译:机械可解释性旨在通过特定可解释特征来理解模型行为,这些特征常被假设表现为激活值的低维子空间。近期研究探索了子空间干预(如激活修补)作为同时操控模型行为并将特征归因于特定子空间的手段。本研究表明,这两个目标存在分歧,可能导致可解释性的幻觉。反直觉的是,即使子空间干预使模型输出行为表现为特征值发生改变,这种效果也可能是通过激活依赖另一子空间的休眠并行通路实现的,而该子空间与模型输出无因果关联。我们在一个精简的数学示例、两个真实场景(间接宾语识别任务与事实召回)中证实了这一现象,并提供了其在实践中普遍存在的证据。在事实召回背景下,我们进一步揭示了与秩1事实编辑的关联,为先前观察到的"事实编辑性能与事实定位不一致"现象提供机械性解释。但这并不意味着子空间激活修补本身不适合可解释性。为定位研究结果,我们展示了任务(间接宾语识别)中的成功案例——此前人工电路分析已揭示特征位置的先验认知。我们探讨了论证修补子空间保真性所需的其他证据。