Towards Automated Circuit Discovery for Mechanistic Interpretability

Recent work in mechanistic interpretability has reverse-engineered nontrivial behaviors of transformer models. These contributions required considerable effort and researcher intuition, which makes it difficult to apply the same methods to understand the complex behavior that current models display. At their core however, the workflow for these discoveries is surprisingly similar. Researchers create a data set and metric that elicit the desired model behavior, subdivide the network into appropriate abstract units, replace activations of those units to identify which are involved in the behavior, and then interpret the functions that these units implement. By varying the data set, metric, and units under investigation, researchers can understand the functionality of each neural network region and the circuits they compose. This work proposes a novel algorithm, Automatic Circuit DisCovery (ACDC), to automate the identification of the important units in the network. Given a model's computational graph, ACDC finds subgraphs that explain a behavior of the model. ACDC was able to reproduce a previously identified circuit for Python docstrings in a small transformer, identifying 6/7 important attention heads that compose up to 3 layers deep, while including 91% fewer the connections.

翻译：近期在机制可解释性领域的研究，已经对Transformer模型的非平凡行为进行了逆向工程。这些贡献需要大量的努力和研究者直觉，使得应用相同方法理解当前模型展现的复杂行为变得困难。然而，这些发现的核心工作流程惊人地相似：研究者创建数据集和指标以激发目标模型行为，将网络细分为适当的抽象单元，替换这些单元的激活以识别参与行为的单元，并解释这些单元所实现的功能。通过改变数据集、指标和受研究单元，研究者能够理解每个神经网络区域的功能及其组成的电路。本文提出了一种新算法——自动电路发现（Automatic Circuit Discovery, ACDC），用于自动化识别网络中的重要单元。给定模型的计算图，ACDC能够找到解释模型行为的子图。在小型Transformer中，ACDC成功复现了先前识别的Python文档字符串相关电路，识别了构成多达3层深度的6/7个重要注意力头，同时连接的冗余度减少了91%。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

ICLR 2021杰出论文奖出炉，8篇论文上榜！

专知会员服务

26+阅读 · 2021年4月2日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日