Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. Then, we validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii) amplifying the variance of targeted token representations. Both interventions can replicate attention sinks at arbitrary positions. Our mechanistic understanding offers a foundation for the systematic control of sink formation. Finally, as a proof of concept, we propose \textit{head-wise RMSNorm}, an architectural modification that stabilizes value aggregation outputs during pre-training. Our experiments demonstrate that restoring statistical parity across positions significantly accelerates convergence.
翻译:尽管大型语言模型中初始token不成比例地垄断注意力得分的注意力坍缩现象普遍存在,其结构起源仍不明确。本文为这一现象提供了机制性解释。首先,我们将根源追溯至自注意力机制中固有的值聚合过程,该过程会引发系统性的方差差异。我们进一步证明,前馈网络层中超级神经元的激活会急剧放大这种差异。具体而言,通道稀疏的下投影操作会引发首token表征的维度失衡,从而迫使注意力坍缩作为结构性锚点形成。随后,我们通过两种受控干预验证了这一因果链:(i)通过注意力掩码修改隔离聚合效应,以及(ii)放大目标token表征的方差。这两种干预均能在任意位置复现注意力坍缩。我们的机制性理解为系统性控制坍缩形成奠定了基础。最后,作为概念验证,我们提出逐头RMS归一化——一种在预训练期间稳定值聚合输出的架构改进。实验表明,恢复各位置间的统计对称性可显著加速收敛。