Recent work in mechanistic interpretability has shown that behaviors in language models can be successfully reverse-engineered through circuit analysis. A common criticism, however, is that each circuit is task-specific, and thus such analysis cannot contribute to understanding the models at a higher level. In this work, we present evidence that insights (both low-level findings about specific heads and higher-level findings about general algorithms) can indeed generalize across tasks. Specifically, we study the circuit discovered in Wang et al. (2022) for the Indirect Object Identification (IOI) task and 1.) show that it reproduces on a larger GPT2 model, and 2.) that it is mostly reused to solve a seemingly different task: Colored Objects (Ippolito & Callison-Burch, 2023). We provide evidence that the process underlying both tasks is functionally very similar, and contains about a 78% overlap in in-circuit attention heads. We further present a proof-of-concept intervention experiment, in which we adjust four attention heads in middle layers in order to 'repair' the Colored Objects circuit and make it behave like the IOI circuit. In doing so, we boost accuracy from 49.6% to 93.7% on the Colored Objects task and explain most sources of error. The intervention affects downstream attention heads in specific ways predicted by their interactions in the IOI circuit, indicating that this subcircuit behavior is invariant to the different task inputs. Overall, our results provide evidence that it may yet be possible to explain large language models' behavior in terms of a relatively small number of interpretable task-general algorithmic building blocks and computational components.
翻译:机械可解释性方面的近期研究表明,语言模型中的行为可以通过电路分析成功进行逆向工程。然而,一个常见的批评是每个电路都是任务特定的,因此此类分析无法帮助在高层次上理解模型。在本文中,我们提出了证据表明,关于特定注意力头的低层级发现和关于通用算法的更高层级发现确实可以跨任务泛化。具体而言,我们研究了Wang等人(2022)为间接宾语识别(IOI)任务发现的电路,并1)证明该电路在更大的GPT2模型上能够复现,2)证明该电路大部分被复用以解决一个看似不同的任务:彩色物体(Ippolito & Callison-Burch, 2023)。我们提供的证据表明,这两个任务背后的过程在功能上高度相似,并且电路中的注意力头约有78%的重叠。我们进一步提出了一个概念验证干预实验,通过调整中间层的四个注意力头来"修复"彩色物体电路,使其行为类似IOI电路。通过这种方式,我们将彩色物体任务的准确率从49.6%提升至93.7%,并解释了大部分错误来源。该干预以IOI电路中预测的特定方式影响了下游注意力头的交互,表明这一子电路行为对不同任务输入具有不变性。总体而言,我们的结果提供了证据,表明用相对较少数量的可解释任务通用算法构建块和计算组件来解释大型语言模型的行为或许仍是可能的。