Circuit Component Reuse Across Tasks in Transformer Language Models

Recent work in mechanistic interpretability has shown that behaviors in language models can be successfully reverse-engineered through circuit analysis. A common criticism, however, is that each circuit is task-specific, and thus such analysis cannot contribute to understanding the models at a higher level. In this work, we present evidence that insights (both low-level findings about specific heads and higher-level findings about general algorithms) can indeed generalize across tasks. Specifically, we study the circuit discovered in Wang et al. (2022) for the Indirect Object Identification (IOI) task and 1.) show that it reproduces on a larger GPT2 model, and 2.) that it is mostly reused to solve a seemingly different task: Colored Objects (Ippolito & Callison-Burch, 2023). We provide evidence that the process underlying both tasks is functionally very similar, and contains about a 78% overlap in in-circuit attention heads. We further present a proof-of-concept intervention experiment, in which we adjust four attention heads in middle layers in order to 'repair' the Colored Objects circuit and make it behave like the IOI circuit. In doing so, we boost accuracy from 49.6% to 93.7% on the Colored Objects task and explain most sources of error. The intervention affects downstream attention heads in specific ways predicted by their interactions in the IOI circuit, indicating that this subcircuit behavior is invariant to the different task inputs. Overall, our results provide evidence that it may yet be possible to explain large language models' behavior in terms of a relatively small number of interpretable task-general algorithmic building blocks and computational components.

翻译：机械可解释性方面的近期研究表明，语言模型中的行为可以通过电路分析成功进行逆向工程。然而，一个常见的批评是每个电路都是任务特定的，因此此类分析无法帮助在高层次上理解模型。在本文中，我们提出了证据表明，关于特定注意力头的低层级发现和关于通用算法的更高层级发现确实可以跨任务泛化。具体而言，我们研究了Wang等人（2022）为间接宾语识别（IOI）任务发现的电路，并1）证明该电路在更大的GPT2模型上能够复现，2）证明该电路大部分被复用以解决一个看似不同的任务：彩色物体（Ippolito & Callison-Burch, 2023）。我们提供的证据表明，这两个任务背后的过程在功能上高度相似，并且电路中的注意力头约有78%的重叠。我们进一步提出了一个概念验证干预实验，通过调整中间层的四个注意力头来"修复"彩色物体电路，使其行为类似IOI电路。通过这种方式，我们将彩色物体任务的准确率从49.6%提升至93.7%，并解释了大部分错误来源。该干预以IOI电路中预测的特定方式影响了下游注意力头的交互，表明这一子电路行为对不同任务输入具有不变性。总体而言，我们的结果提供了证据，表明用相对较少数量的可解释任务通用算法构建块和计算组件来解释大型语言模型的行为或许仍是可能的。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

37+阅读 · 2019年10月17日