The quadratic computation complexity of self-attention has been a persistent challenge when applying Transformer models to vision tasks. Linear attention, on the other hand, offers a much more efficient alternative with its linear complexity by approximating the Softmax operation through carefully designed mapping functions. However, current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead from the mapping functions. In this paper, we propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness. Specifically, we first analyze the factors contributing to the performance degradation of linear attention from two perspectives: the focus ability and feature diversity. To overcome these limitations, we introduce a simple yet effective mapping function and an efficient rank restoration module to enhance the expressiveness of self-attention while maintaining low computation complexity. Extensive experiments show that our linear attention module is applicable to a variety of advanced vision Transformers, and achieves consistently improved performances on multiple benchmarks. Code is available at https://github.com/LeapLabTHU/FLatten-Transformer.
翻译:自注意力机制的二次计算复杂度一直是Transformer模型应用于视觉任务时的持久挑战。另一方面,线性注意力通过精心设计的映射函数近似Softmax操作,以线性复杂度提供了更高效的替代方案。然而,当前线性注意力方法要么面临显著的性能下降,要么因映射函数引入额外计算开销。本文提出了一种新颖的聚焦线性注意力模块,旨在同时实现高效率和强表现力。具体而言,我们首先从聚焦能力和特征多样性两个角度分析了导致线性注意力性能下降的因素。为克服这些限制,我们引入了一种简单而有效的映射函数和高效的秩恢复模块,在保持低计算复杂度的同时增强了自注意力的表现力。大量实验表明,我们的线性注意力模块可应用于多种先进的视觉Transformer,并在多项基准测试中取得持续改进的性能。代码已开源:https://github.com/LeapLabTHU/FLatten-Transformer。