The Softmax attention mechanism in Transformer models is notoriously computationally expensive, particularly due to its quadratic complexity, posing significant challenges in vision applications. In contrast, linear attention provides a far more efficient solution by reducing the complexity to linear levels. However, compared to Softmax attention, linear attention often experiences significant performance degradation. Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map, which hinders its ability to adequately model complex spatial information. In this paper, to break the low-rank dilemma of linear attention, we conduct rank analysis from two perspectives: the KV buffer and the output features. Consequently, we introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency. Based on RALA, we construct the Rank-Augmented Vision Linear Transformer (RAVLT). Extensive experiments demonstrate that RAVLT achieves excellent performance across various vision tasks. Specifically, without using any additional labels, data, or supervision during training, RAVLT achieves an 84.4% Top-1 accuracy on ImageNet-1k with only 26M parameters and 4.6G FLOPs. This result significantly surpasses previous linear attention mechanisms, fully illustrating the potential of RALA. Code will be available at https://github.com/qhfan/RALA.
翻译:Transformer模型中的Softmax注意力机制因其二次复杂度而计算开销巨大,这在视觉应用中构成了显著挑战。相比之下,线性注意力通过将复杂度降至线性水平,提供了更为高效的解决方案。然而,与Softmax注意力相比,线性注意力通常存在显著的性能下降。我们的实验表明,这种性能下降源于线性注意力特征映射的低秩特性,这阻碍了其充分建模复杂空间信息的能力。本文为突破线性注意力的低秩困境,从KV缓存和输出特征两个视角进行了秩分析。基于此,我们提出了秩增强线性注意力(RALA),它在保持线性复杂度和高效率的同时,性能可与Softmax注意力相媲美。基于RALA,我们构建了秩增强视觉线性Transformer(RAVLT)。大量实验表明,RAVLT在各种视觉任务上均取得了优异的性能。具体而言,在训练过程中未使用任何额外标签、数据或监督的情况下,RAVLT仅以26M参数和4.6G FLOPs,在ImageNet-1k上实现了84.4%的Top-1准确率。这一结果显著超越了以往的线性注意力机制,充分展现了RALA的潜力。代码将在https://github.com/qhfan/RALA 发布。