Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-attention mechanism underpinning the strength of ViTs has a quadratic complexity in both computation and memory usage. This motivates the development of approximating the self-attention at linear complexity. However, an in-depth analysis in this work reveals that existing methods are either theoretically flawed or empirically ineffective for visual recognition. We identify that their limitations are rooted in the inheritance of softmax-based self-attention during approximations, that is, normalizing the scaled dot-product between token feature vectors using the softmax function. As preserving the softmax operation challenges any subsequent linearization efforts. By this insight, a family of Softmax-Free Transformers (SOFT) are proposed. Specifically, a Gaussian kernel function is adopted to replace the dot-product similarity, enabling a full self-attention matrix to be approximated under low-rank matrix decomposition. For computational robustness, we estimate the Moore-Penrose inverse using an iterative Newton-Raphson method in the forward process only, while calculating its theoretical gradients only once in the backward process. To further expand applicability (e.g., dense prediction tasks), an efficient symmetric normalization technique is introduced. Extensive experiments on ImageNet, COCO, and ADE20K show that our SOFT significantly improves the computational efficiency of existing ViT variants. With linear complexity, much longer token sequences are permitted by SOFT, resulting in superior trade-off between accuracy and complexity. Code and models are available at https://github.com/fudan-zvg/SOFT.
翻译:视觉Transformer(ViT)在视觉感知任务中推动了当前最优性能。自注意力机制是ViT优势的核心,但其计算和内存复杂度均为二次方。这促使人们开发线性复杂度的自注意力近似方法。然而,本工作的深入分析表明,现有方法在理论上存在缺陷,或在视觉识别中经验上无效。我们发现其局限性根源在于近似过程中继承了基于softmax的自注意力机制,即使用softmax函数对令牌特征向量之间的缩放点积进行归一化。保留softmax操作给后续线性化尝试带来了挑战。基于这一洞察,我们提出了一系列无softmax的Transformer(SOFT)。具体来说,采用高斯核函数替代点积相似性,使得全自注意力矩阵能在低秩矩阵分解下被近似。为保障计算鲁棒性,我们仅在正向过程中使用迭代牛顿-拉弗森方法估计Moore-Penrose逆,同时在反向过程中仅一次性计算其理论梯度。为进一步扩展适用性(例如密集预测任务),引入了一种高效的对称归一化技术。在ImageNet、COCO和ADE20K上的大量实验表明,我们的SOFT显著提升了现有ViT变体的计算效率。凭借线性复杂度,SOFT允许处理更长的令牌序列,从而在准确性和复杂度之间实现更优权衡。代码和模型已在https://github.com/fudan-zvg/SOFT公开。