A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training

Zihan Qiu,Zeyu Huang,Kaiyue Wen,Peng Jin,Bo Zheng,Yuxin Zhou,Haofeng Huang,Zekun Wang,Xiao Li,Huaqing Zhang,Yang Xu,Haoran Lian,Siqi Zhang,Rui Men,Jianwei Zhang,Ivan Titov,Dayiheng Liu,Jingren Zhou,Junyang Lin

We investigate the functional role of emergent outliers in large language models, specifically attention sinks (a few tokens that consistently receive large attention logits) and residual sinks (a few fixed dimensions with persistently large activations across most tokens). We hypothesize that these outliers, in conjunction with the corresponding normalizations (\textit{e.g.}, softmax attention and RMSNorm), effectively rescale other non-outlier components. We term this phenomenon \textit{outlier-driven rescaling} and validate this hypothesis across different model architectures and training token counts. This view unifies the origin and mitigation of both sink types. Our main conclusions and observations include: (1) Outliers function jointly with normalization: removing normalization eliminates the corresponding outliers but degrades training stability and performance; directly clipping outliers while retaining normalization leads to degradation, indicating that outlier-driven rescaling contributes to training stability. (2) Outliers serve more as rescale factors rather than contributors, as the final contributions of attention and residual sinks are significantly smaller than those of non-outliers. (3) Outliers can be absorbed into learnable parameters or mitigated via explicit gated rescaling, leading to improved training performance (average gain of 2 points) and enhanced quantization robustness (1.2 points degradation under W4A4 quantization).

翻译：本研究探讨了大型语言模型中涌现离群值的功能角色，具体包括注意力汇（持续获得较大注意力对数值的少量词元）和残差汇（在多数词元中持续保持较大激活值的固定维度）。我们假设这些离群值与相应的归一化操作（例如softmax注意力和RMSNorm）共同作用，能够有效重缩放其他非离群成分。我们将此现象称为离群值驱动的重缩放，并在不同模型架构和训练词元数量下验证了这一假设。该视角统一了两种汇类型的起源与缓解机制。我们的主要结论和观察包括：（1）离群值与归一化协同作用：移除归一化会消除相应离群值但会损害训练稳定性与性能；在保留归一化的同时直接裁剪离群值会导致性能下降，表明离群值驱动的重缩放有助于维持训练稳定性。（2）离群值主要充当重缩放因子而非贡献主体，因为注意力汇与残差汇的最终贡献度显著小于非离群成分。（3）离群值可被吸收至可学习参数中，或通过显式门控重缩放机制进行缓解，从而提升训练性能（平均增益2个百分点）并增强量化鲁棒性（在W4A4量化下性能仅下降1.2个百分点）。