Interpreting Attention Layer Outputs with Sparse Autoencoders

Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse, interpretable features, and have been applied to MLP layers and the residual stream. In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition. We demonstrate this on transformers from several model families and up to 2B parameters. We perform a qualitative study of the features computed by attention layers, and find multiple families: long-range context, short-range context and induction features. We qualitatively study the role of every head in GPT-2 Small, and estimate that at least 90% of the heads are polysemantic, i.e. have multiple unrelated roles. Further, we show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work. For example, we explore the mystery of why models have so many seemingly redundant induction heads, use SAEs to motivate the hypothesis that some are long-prefix whereas others are short-prefix, and confirm this with more rigorous analysis. We use our SAEs to analyze the computation performed by the Indirect Object Identification circuit (Wang et al.), validating that the SAEs find causally meaningful intermediate variables, and deepening our understanding of the semantics of the circuit. We open-source the trained SAEs and a tool for exploring arbitrary prompts through the lens of Attention Output SAEs.

翻译：将模型激活分解为可解释的组件是机制可解释性研究中的一个关键开放问题。稀疏自编码器是一种流行方法，用于将训练好的Transformer模型的内部激活分解为稀疏且可解释的特征，并已应用于MLP层和残差流。在本研究中，我们在注意力层输出上训练稀疏自编码器，并证明在此场景下稀疏自编码器同样能发现稀疏且可解释的分解。我们在多个模型系列、参数规模高达20亿的Transformer模型上验证了这一结论。我们对注意力层计算的特征进行了定性研究，发现了多种特征族：长程上下文特征、短程上下文特征及归纳特征。我们定性研究了GPT-2 Small模型中每个注意力头的作用，估计至少90%的注意力头具有多义性，即承担多个不相关的功能。此外，我们证明稀疏自编码器是一种实用工具，能使研究者比以往工作更细致地解释模型行为。例如，我们探究了模型为何存在大量看似冗余的归纳头这一谜题，利用稀疏自编码器提出假设：部分归纳头处理长前缀而其他处理短前缀，并通过更严谨的分析验证了该假设。我们使用稀疏自编码器分析了间接对象识别电路的计算过程，验证了稀疏自编码器能发现具有因果意义的中间变量，并深化了对该电路语义的理解。我们开源了训练好的稀疏自编码器及一个工具，该工具可通过注意力输出稀疏自编码器的视角探索任意输入文本。