Recent work in mechanistic interpretability has shown that behaviors in language models can be successfully reverse-engineered through circuit analysis. A common criticism, however, is that each circuit is task-specific, and thus such analysis cannot contribute to understanding the models at a higher level. In this work, we present evidence that insights (both low-level findings about specific heads and higher-level findings about general algorithms) can indeed generalize across tasks. Specifically, we study the circuit discovered in Wang et al. (2022) for the Indirect Object Identification (IOI) task and 1.) show that it reproduces on a larger GPT2 model, and 2.) that it is mostly reused to solve a seemingly different task: Colored Objects (Ippolito & Callison-Burch, 2023). We provide evidence that the process underlying both tasks is functionally very similar, and contains about a 78% overlap in in-circuit attention heads. We further present a proof-of-concept intervention experiment, in which we adjust four attention heads in middle layers in order to 'repair' the Colored Objects circuit and make it behave like the IOI circuit. In doing so, we boost accuracy from 49.6% to 93.7% on the Colored Objects task and explain most sources of error. The intervention affects downstream attention heads in specific ways predicted by their interactions in the IOI circuit, indicating that this subcircuit behavior is invariant to the different task inputs. Overall, our results provide evidence that it may yet be possible to explain large language models' behavior in terms of a relatively small number of interpretable task-general algorithmic building blocks and computational components.
翻译:近期在机制可解释性方面的研究表明,语言模型中的行为可以通过电路分析成功地进行逆向工程。然而,一个常见的批评是每个电路都是任务特定的,因此这类分析无助于在更高层次上理解模型。在本工作中,我们提供了证据表明,关于特定注意力头的低层发现以及关于通用算法的高层发现确实可以跨任务泛化。具体而言,我们研究了Wang等人(2022)为间接宾语识别(IOI)任务发现的电路,并1.) 证明该电路可在更大的GPT2模型上复现,2.) 证明它大部分被复用于解决一个看似不同的任务:彩色物体(Ippolito & Callison-Burch, 2023)。我们提供的证据表明,这两个任务背后的过程在功能上非常相似,且电路内的注意力头重叠率约为78%。我们进一步展示了一项概念验证干预实验,通过调整中间层的四个注意力头来"修复"彩色物体电路,使其行为类似IOI电路。通过此举,我们将彩色物体任务的准确率从49.6%提升至93.7%,并解释了大部分错误来源。该干预以特定方式影响下游注意力头,这些方式与它们在IOI电路中的交互作用预测一致,表明该子电路行为对不同任务输入具有不变性。总体而言,我们的结果提供了证据,表明或许仍有可能用相对少量的可解释任务通用算法构建块和计算组件来解释大语言模型的行为。