Large language models have achieved remarkable success in recent years, primarily due to self-attention. However, traditional Softmax attention suffers from numerical instability and reduced performance as the number of inference tokens increases. This work addresses these issues by proposing a new design principle for attention, viewing it as a two-stage process. The first stage (normalisation) refines standard attention by replacing Softmax with the more numerically stable Softplus followed by $l_{1}$-normalisation. Furthermore, we introduce a dynamic scale factor based on invariance entropy. We show that this novel attention mechanism outperforms conventional Softmax attention, and state-of-the-art Softmax-free alternatives. Our second proposal is to introduce a second processing stage (sharpening) which consists of a re-weighting mechanism that amplifies significant attentional weights while diminishing weaker ones. This enables the model to concentrate more effectively on relevant tokens, mitigating the attention sink phenomenon, and fundamentally improving length extrapolation. This novel, two-stage, replacement for self-attention is shown to ensure numerical stability and dramatically improve length extrapolation, maintaining a nearly constant validation loss at 16$\times$ the training length while achieving superior results on challenging long-context retrieval tasks and downstream benchmarks. Furthermore, symbolic regression experiments demonstrate that our method enables models to recover Newton's gravitational law from orbital trajectory sequences, providing evidence that appropriate attention mechanisms are crucial for foundation models to develop genuine physical world models.
翻译:近年来,大语言模型取得了显著成功,这主要归功于自注意力机制。然而,传统的Softmax注意力在推理令牌数量增加时存在数值不稳定性和性能下降的问题。本文通过提出一种新的注意力设计原则来解决这些问题,将注意力视为一个两阶段过程。第一阶段(归一化)通过用数值更稳定的Softplus替换Softmax,并随后进行$l_{1}$-归一化,从而改进标准注意力。此外,我们引入了一个基于不变性熵的动态缩放因子。我们证明这种新颖的注意力机制优于传统的Softmax注意力以及最先进的无Softmax替代方案。我们的第二个提议是引入第二个处理阶段(锐化),该阶段包含一个重加权机制,用于放大重要的注意力权重同时削弱较弱的权重。这使得模型能够更有效地聚焦于相关令牌,缓解注意力沉没现象,并从根本上改善长度外推能力。这种新颖的两阶段自注意力替代方案被证明能够确保数值稳定性,并显著提升长度外推性能,在训练长度16倍的条件下保持几乎恒定的验证损失,同时在具有挑战性的长上下文检索任务和下游基准测试中取得优异结果。此外,符号回归实验表明,我们的方法使模型能够从轨道轨迹序列中恢复牛顿万有引力定律,这为适当的注意力机制对于基础模型建立真实的物理世界模型至关重要提供了证据。