Large language models have achieved remarkable success in recent years, primarily due to the implementation of self-attention mechanisms. However, traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases. This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm. We identify the latter as essential for maintaining model performance. By replacing the non-linear transformation with the Softplus activation function and introducing a dynamic length scale factor for different token lengths based on invariance entropy, we create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths. To further improve the length extrapolation ability of the proposed attention mechanism, we introduce a re-weighting mechanism that amplifies significant attention weights while diminishing weaker ones, enabling the model to concentrate more effectively on relevant tokens. When combined with our proposed attention mechanism, this approach demonstrates significant promise in managing longer sequences, maintaining nearly constant validation loss even at 16$\times$ the training token length while ensuring numerical stability. Our code is available at: https://github.com/iminfine/freeatten.
翻译:近年来,大语言模型取得了显著成功,主要得益于自注意力机制的应用。然而,传统的Softmax注意力在推理标记长度增加时存在数值不稳定性和性能下降的问题。本文通过将Softmax操作分解为非线性变换和$l_1$-范数来解决这些问题,并指出后者对维持模型性能至关重要。通过使用Softplus激活函数替代非线性变换,并基于不变性熵为不同标记长度引入动态长度尺度因子,我们提出了一种新型注意力机制,其在多种推理长度下的性能均优于传统Softmax注意力。为进一步提升所提注意力机制的长度外推能力,我们引入了一种重加权机制,该机制可增强显著注意力权重并削弱较弱权重,使模型能更有效地聚焦于相关标记。当与我们提出的注意力机制结合时,该方法在处理长序列方面展现出显著潜力,即使在训练标记长度的16$\times$时仍能保持近乎恒定的验证损失,同时确保数值稳定性。我们的代码发布于:https://github.com/iminfine/freeatten。