Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token interactions within sequences through the utilization of softmax function. Conversely, linear attention presents a more computationally efficient alternative by approximating the softmax operation with linear complexity. However, it exhibits substantial performance degradation when compared to the traditional softmax attention mechanism. In this paper, we bridge the gap in our theoretical understanding of the reasons behind the practical performance gap between softmax and linear attention. By conducting a comprehensive comparative analysis of these two attention mechanisms, we shed light on the underlying reasons for why softmax attention outperforms linear attention in most scenarios.
翻译:大型Transformer模型已在众多自然语言处理任务中取得了最先进的成果。在Transformer架构的关键组成部分中,注意力机制通过利用softmax函数在捕获序列内标记交互方面发挥着至关重要的作用。相比之下,线性注意力通过以线性复杂度近似softmax操作,提供了一种计算效率更高的替代方案。然而,与传统的softmax注意力机制相比,其性能存在显著下降。本文弥补了我们在理论理解上关于softmax与线性注意力之间实际性能差距原因的空白。通过对这两种注意力机制进行全面比较分析,我们揭示了softmax注意力在大多数场景中优于线性注意力的根本原因。