Recent work in mechanistic interpretability has shown that behaviors in language models can be successfully reverse-engineered through circuit analysis. A common criticism, however, is that each circuit is task-specific, and thus such analysis cannot contribute to understanding the models at a higher level. In this work, we present evidence that insights (both low-level findings about specific heads and higher-level findings about general algorithms) can indeed generalize across tasks. Specifically, we study the circuit discovered in Wang et al. (2022) for the Indirect Object Identification (IOI) task and 1.) show that it reproduces on a larger GPT2 model, and 2.) that it is mostly reused to solve a seemingly different task: Colored Objects (Ippolito & Callison-Burch, 2023). We provide evidence that the process underlying both tasks is functionally very similar, and contains about a 78% overlap in in-circuit attention heads. We further present a proof-of-concept intervention experiment, in which we adjust four attention heads in middle layers in order to 'repair' the Colored Objects circuit and make it behave like the IOI circuit. In doing so, we boost accuracy from 49.6% to 93.7% on the Colored Objects task and explain most sources of error. The intervention affects downstream attention heads in specific ways predicted by their interactions in the IOI circuit, indicating that this subcircuit behavior is invariant to the different task inputs. Overall, our results provide evidence that it may yet be possible to explain large language models' behavior in terms of a relatively small number of interpretable task-general algorithmic building blocks and computational components.
翻译:近期机械可解释性研究表明,语言模型的行为可通过电路分析成功逆向工程。然而,常见批评指出每个电路具有任务特异性,因此此类分析无法助力对模型进行更高层次的理解。本研究证实,无论是关于特定注意力头部的低层发现,还是关于通用算法的高层洞见,确实能够跨任务泛化。具体而言,我们考察了Wang等人(2022)为间接宾语识别(IOI)任务发现的电路,并1)证明该电路可在更大的GPT2模型上复现,2)证明该电路主体被复用于解决看似不同的任务:彩色物体(Ippolito & Callison-Burch, 2023)。我们提供的证据表明,支撑这两项任务的底层过程在功能上高度相似,电路内注意力头部的重合度约为78%。我们进一步开展了概念验证干预实验:通过调整中间层的四个注意力头部来"修复"彩色物体电路,使其行为类似于IOI电路。此举将彩色物体任务的准确率从49.6%提升至93.7%,并解释了大部分错误来源。该干预以IOI电路中预测的特定交互方式影响下游注意力头部,表明该子电路行为对不同任务输入具有不变性。总体而言,我们的结果证明:用相对少量可解释的任务通用算法构建模块与计算组件来解释大语言模型的行为或许终将成为可能。