Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token interactions within sequences through the utilization of softmax function. Conversely, linear attention presents a more computationally efficient alternative by approximating the softmax operation with linear complexity. However, it exhibits substantial performance degradation when compared to the traditional softmax attention mechanism. In this paper, we bridge the gap in our theoretical understanding of the reasons behind the practical performance gap between softmax and linear attention. By conducting a comprehensive comparative analysis of these two attention mechanisms, we shed light on the underlying reasons for why softmax attention outperforms linear attention in most scenarios.
翻译:大型Transformer模型在众多自然语言处理任务中取得了最先进的结果。在Transformer架构的关键组件中,注意力机制通过使用Softmax函数在捕获序列内词元交互方面发挥着至关重要的作用。相比之下,线性注意力通过以线性复杂度近似Softmax运算,提供了一种计算效率更高的替代方案。然而,当与传统的Softmax注意力机制相比时,其表现出显著的性能下降。本文从理论上弥合了对Softmax与线性注意力之间实际性能差距原因的理解空白。通过全面比较分析这两种注意力机制,我们阐明了Softmax注意力在大多数场景下优于线性注意力的根本原因。