Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs.
翻译:实践者持续在基于Transformer的大语言模型(LLMs)中观察到三种令人困惑的现象:注意力汇聚、价值状态流失与残差状态峰值,统称为极端令牌现象。这些现象的特征表现为某些所谓的“汇聚令牌”获得异常高的注意力权重、呈现显著更小的价值状态,且其残差状态范数远大于其他令牌。这些极端令牌给LLM的推理、量化和可解释性带来了诸多挑战。本文阐明了极端令牌现象背后的机制。首先,我们证明这些现象出现在极简架构中——即在一至三层Transformer上训练、基于玩具模型Bigram-Backcopy(BB)任务时。在此设定下,我们发现了一种活跃-休眠机制:注意力头在特定输入域中成为汇聚点,而在其他域中则保持非汇聚状态。我们对训练动态的理论分析表明,这些现象由一种相互强化机制驱动。基于这些发现,我们提出了在预训练阶段缓解极端令牌现象的策略,包括用ReLU替换softmax以及用SGD替换Adam。随后,我们将分析扩展至预训练LLM(包括Llama和OLMo),证明许多注意力头展现出与BB任务相似的活跃-休眠机制,且相互强化机制同样主导着LLM预训练过程中极端令牌现象的产生。我们的研究结果表明,BB任务所预测的极端令牌现象的静态与动态特性,与在预训练LLM中观察到的现象高度吻合。