Transcoders Find Interpretable LLM Feature Circuits

A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely many neurons, each with its own nonlinearity to account for. Circuit analysis in this setting thus either yields intractably large circuits or fails to disentangle local and global behavior. To address this we explore transcoders, which seek to faithfully approximate a densely activating MLP layer with a wider, sparsely-activating MLP layer. We introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers. The resulting circuits neatly factorize into input-dependent and input-invariant terms. We then successfully train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and human-interpretability. Finally, we apply transcoders to reverse-engineer unknown circuits in the model, and we obtain novel insights regarding the "greater-than circuit" in GPT2-small. Our results suggest that transcoders can prove effective in decomposing model computations involving MLPs into interpretable circuits. Code is available at https://github.com/jacobdunefsky/transcoder_circuits/.

翻译：机制可解释性的一个核心目标是电路分析：寻找与特定行为或能力相对应的模型的稀疏子图。然而，多层感知机子层使得在基于Transformer的语言模型上进行细粒度的电路分析变得困难。具体而言，可解释特征——例如那些通过稀疏自编码器发现的——通常是极大量神经元的线性组合，每个神经元都有其自身的非线性需要考虑。因此，在此设置下的电路分析要么产生难以处理的大型电路，要么无法分离局部和全局行为。为解决此问题，我们探索了transcoders，其旨在用一个更宽、稀疏激活的多层感知机层来忠实地近似一个密集激活的多层感知机层。我们引入了一种新颖的方法，利用transcoders通过多层感知机子层进行基于权重的电路分析。由此产生的电路可以清晰地分解为输入相关项和输入无关项。随后，我们成功地在参数规模为1.2亿、4.1亿和14亿的语言模型上训练了transcoders，并发现其在稀疏性、忠实度和人类可解释性方面至少与稀疏自编码器表现相当。最后，我们将transcoders应用于逆向工程模型中未知的电路，并就GPT2-small中的"大于电路"获得了新的见解。我们的结果表明，transcoders可以有效地将涉及多层感知机的模型计算分解为可解释的电路。代码可在 https://github.com/jacobdunefsky/transcoder_circuits/ 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日