Transformer architectures, and their attention mechanisms in particular, form the foundation of modern large language models. While transformer models are widely believed to operate in high-dimensional hidden spaces, we show that attention outputs are in fact confined to a surprisingly low-dimensional subspace, with an effective dimensionality of only about $60\%$ of the full space. In contrast, MLP outputs and residual streams remain much closer to full-rank, exhibiting effective ranks around $90\%$. This striking dimensional discrepancy is consistently observed across diverse model families and datasets, and is strongly shaped by the attention output projection matrix. Critically, we find this low-rank structure as a key factor of the prevalent dead feature problem in sparse dictionary learning, where it creates a mismatch between randomly initialized features and the intrinsic geometry of the activation space. Building on this insight, we propose a subspace-constrained training method for sparse autoencoders (SAEs), initializing feature directions into the active subspace of activations. Our approach reduces dead features from 87\% to below 1\% in Attention Output SAEs with 1M features, and can further extend to other sparse dictionary learning methods. Our findings provide both new insights into the geometry of attention and practical tools for improving sparse dictionary learning in large language models.
翻译:Transformer架构,尤其是其注意力机制,构成了现代大语言模型的基础。尽管普遍认为Transformer模型在高维隐藏空间中运行,但我们发现注意力输出实际上被限制在一个令人惊讶的低维子空间中,其有效维度仅约为全空间的$60\%$。相比之下,MLP输出和残差流则更接近满秩,表现出约$90\%$的有效秩。这种显著的维度差异在不同模型家族和数据集上均被一致观察到,并且主要由注意力输出投影矩阵塑造。至关重要的是,我们发现这种低秩结构是稀疏字典学习中普遍存在的"死特征"问题的关键因素,它在随机初始化的特征与激活空间的内在几何结构之间造成了不匹配。基于这一洞见,我们提出了一种用于稀疏自编码器(SAEs)的子空间约束训练方法,将特征方向初始化为激活的活跃子空间。我们的方法在拥有100万个特征的注意力输出SAE中,将死特征比例从87%降低到1%以下,并且可以进一步扩展到其他稀疏字典学习方法。我们的发现既为注意力的几何结构提供了新的见解,也为改进大语言模型中的稀疏字典学习提供了实用工具。