Large language models have achieved remarkable success in recent years, primarily due to the implementation of self-attention mechanisms. However, traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases. This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm. We identify the latter as essential for maintaining model performance. By replacing the non-linear transformation with the Softplus activation function and introducing a dynamic scale factor for different token lengths based on invariance entropy, we create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths. To further improve the length extrapolation ability of the proposed attention mechanism, we introduce a fine-tuning-free re-weighting mechanism that amplifies significant attention weights while diminishing weaker ones, enabling the model to concentrate more effectively on relevant tokens without requiring retraining. When combined with our proposed attention mechanism, this approach demonstrates significant promise in managing longer sequences, maintaining nearly constant validation loss even at 16$\times$ the training token length while ensuring numerical stability. Our code is available at: https://github.com/iminfine/freeatten.
翻译:近年来,大语言模型取得了显著成功,这主要归功于自注意力机制的应用。然而,传统的Softmax注意力在推理标记长度增加时存在数值不稳定和性能下降的问题。本文通过将Softmax操作分解为非线性变换和$l_1$范数来解决这些问题。我们发现后者对于维持模型性能至关重要。通过将非线性变换替换为Softplus激活函数,并基于不变性熵为不同标记长度引入动态缩放因子,我们提出了一种新型注意力机制,其在各种推理长度上的性能均优于传统Softmax注意力。为进一步提升所提注意力机制的长度外推能力,我们引入了一种无需微调的重加权机制,该机制放大显著注意力权重并削弱较弱权重,使模型能够更有效地聚焦于相关标记而无需重新训练。当与我们提出的注意力机制结合时,该方法在管理更长序列方面展现出巨大潜力,即使在训练标记长度的16$\times$时仍能保持近乎恒定的验证损失,同时确保数值稳定性。我们的代码公开于:https://github.com/iminfine/freeatten。