A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely many neurons, each with its own nonlinearity to account for. Circuit analysis in this setting thus either yields intractably large circuits or fails to disentangle local and global behavior. To address this we explore transcoders, which seek to faithfully approximate a densely activating MLP layer with a wider, sparsely-activating MLP layer. We introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers. The resulting circuits neatly factorize into input-dependent and input-invariant terms. We then successfully train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and human-interpretability. Finally, we apply transcoders to reverse-engineer unknown circuits in the model, and we obtain novel insights regarding the "greater-than circuit" in GPT2-small. Our results suggest that transcoders can prove effective in decomposing model computations involving MLPs into interpretable circuits. Code is available at https://github.com/jacobdunefsky/transcoder_circuits/.
翻译:机制可解释性的一个核心目标是电路分析:寻找与特定行为或能力相对应的模型的稀疏子图。然而,多层感知机子层使得在基于Transformer的语言模型上进行细粒度的电路分析变得困难。具体而言,可解释特征——例如那些通过稀疏自编码器发现的——通常是极大量神经元的线性组合,每个神经元都有其自身的非线性需要考虑。因此,在此设置下的电路分析要么产生难以处理的大型电路,要么无法分离局部和全局行为。为解决此问题,我们探索了transcoders,其旨在用一个更宽、稀疏激活的多层感知机层来忠实地近似一个密集激活的多层感知机层。我们引入了一种新颖的方法,利用transcoders通过多层感知机子层进行基于权重的电路分析。由此产生的电路可以清晰地分解为输入相关项和输入无关项。随后,我们成功地在参数规模为1.2亿、4.1亿和14亿的语言模型上训练了transcoders,并发现其在稀疏性、忠实度和人类可解释性方面至少与稀疏自编码器表现相当。最后,我们将transcoders应用于逆向工程模型中未知的电路,并就GPT2-small中的"大于电路"获得了新的见解。我们的结果表明,transcoders可以有效地将涉及多层感知机的模型计算分解为可解释的电路。代码可在 https://github.com/jacobdunefsky/transcoder_circuits/ 获取。