Residual Stream Analysis with Multi-Layer SAEs

Sparse autoencoders (SAEs) are a promising approach to interpreting the internal representations of transformer language models. However, SAEs are usually trained separately on each transformer layer, making it difficult to use them to study how information flows across layers. To solve this problem, we introduce the multi-layer SAE (MLSAE): a single SAE trained on the residual stream activation vectors from every transformer layer. Given that the residual stream is understood to preserve information across layers, we expected MLSAE latents to `switch on' at a token position and remain active at later layers. Interestingly, we find that individual latents are often active at a single layer for a given token or prompt, but this layer may differ for different tokens or prompts. We quantify these phenomena by defining a distribution over layers and considering its variance. We find that the variance of the distributions of latent activations over layers is about two orders of magnitude greater when aggregating over tokens compared with a single token. For larger underlying models, the degree to which latents are active at multiple layers increases, which is consistent with the fact that the residual stream activation vectors at adjacent layers become more similar. Finally, we relax the assumption that the residual stream basis is the same at every layer by applying pre-trained tuned-lens transformations, but our findings remain qualitatively similar. Our results represent a new approach to understanding how representations change as they flow through transformers. We release our code to train and analyze MLSAEs at https://github.com/tim-lawson/mlsae.

翻译：稀疏自编码器（SAE）是解释Transformer语言模型内部表示的一种有前景的方法。然而，SAE通常针对每个Transformer层单独训练，这使得难以利用它们研究信息如何在层间流动。为解决此问题，我们引入了多层稀疏自编码器（MLSAE）：一种在来自所有Transformer层的残差流激活向量上训练的单一SAE。鉴于残差流被认为能在各层间保留信息，我们预期MLSAE的潜在变量会在某个词元位置“激活”并在后续层中保持活跃。有趣的是，我们发现对于给定的词元或提示，单个潜在变量通常仅在某一层活跃，但该层可能因不同词元或提示而异。我们通过定义层上的分布并考虑其方差来量化这些现象。研究发现，在聚合多个词元时，潜在激活在层上的分布方差比单个词元情况下大约高出两个数量级。对于更大的基础模型，潜在变量在多层活跃的程度会增强，这与相邻层残差流激活向量变得更相似的事实一致。最后，我们通过应用预训练的调焦透镜变换放宽了残差流基在各层相同的假设，但研究结论在定性上保持相似。我们的成果为理解表示在Transformer中流动时的变化提供了新途径。我们在https://github.com/tim-lawson/mlsae发布了用于训练和分析MLSAE的代码。