Recent work in mechanistic interpretability has reverse-engineered nontrivial behaviors of transformer models. These contributions required considerable effort and researcher intuition, which makes it difficult to apply the same methods to understand the complex behavior that current models display. At their core however, the workflow for these discoveries is surprisingly similar. Researchers create a data set and metric that elicit the desired model behavior, subdivide the network into appropriate abstract units, replace activations of those units to identify which are involved in the behavior, and then interpret the functions that these units implement. By varying the data set, metric, and units under investigation, researchers can understand the functionality of each neural network region and the circuits they compose. This work proposes a novel algorithm, Automatic Circuit DisCovery (ACDC), to automate the identification of the important units in the network. Given a model's computational graph, ACDC finds subgraphs that explain a behavior of the model. ACDC was able to reproduce a previously identified circuit for Python docstrings in a small transformer, identifying 6/7 important attention heads that compose up to 3 layers deep, while including 91% fewer the connections.
翻译:近期在机制可解释性领域的研究,已经对Transformer模型的非平凡行为进行了逆向工程。这些贡献需要大量的努力和研究者直觉,使得应用相同方法理解当前模型展现的复杂行为变得困难。然而,这些发现的核心工作流程惊人地相似:研究者创建数据集和指标以激发目标模型行为,将网络细分为适当的抽象单元,替换这些单元的激活以识别参与行为的单元,并解释这些单元所实现的功能。通过改变数据集、指标和受研究单元,研究者能够理解每个神经网络区域的功能及其组成的电路。本文提出了一种新算法——自动电路发现(Automatic Circuit Discovery, ACDC),用于自动化识别网络中的重要单元。给定模型的计算图,ACDC能够找到解释模型行为的子图。在小型Transformer中,ACDC成功复现了先前识别的Python文档字符串相关电路,识别了构成多达3层深度的6/7个重要注意力头,同时连接的冗余度减少了91%。