Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a quadratic complexity with the number of input tokens. Existing efficient ViTs adopt local attention (e.g., Swin) or linear attention (e.g., Performer), which sacrifice ViTs' capabilities of capturing either global or local context. In this work, we ask an important research question: Can ViTs learn both global and local context while being more efficient during inference? To this end, we propose a framework called Castling-ViT, which trains ViTs using both linear-angular attention and masked softmax-based quadratic attention, but then switches to having only linear angular attention during ViT inference. Our Castling-ViT leverages angular kernels to measure the similarities between queries and keys via spectral angles. And we further simplify it with two techniques: (1) a novel linear-angular attention mechanism: we decompose the angular kernels into linear terms and high-order residuals, and only keep the linear terms; and (2) we adopt two parameterized modules to approximate high-order residuals: a depthwise convolution and an auxiliary masked softmax attention to help learn both global and local information, where the masks for softmax attention are regularized to gradually become zeros and thus incur no overhead during ViT inference. Extensive experiments and ablation studies on three tasks consistently validate the effectiveness of the proposed Castling-ViT, e.g., achieving up to a 1.8% higher accuracy or 40% MACs reduction on ImageNet classification and 1.2 higher mAP on COCO detection under comparable FLOPs, as compared to ViTs with vanilla softmax-based attentions.
翻译:视觉Transformer(ViTs)已展现出令人印象深刻的性能,但与卷积神经网络(CNNs)相比,其仍需要较高的计算成本。原因之一是ViT的注意力机制度量全局相似性,因此具有与输入令牌数量相关的二次复杂度。现有的高效ViT采用局部注意力(如Swin)或线性注意力(如Performer),但这牺牲了ViT捕获全局或局部上下文的能力。在这项工作中,我们提出一个重要研究问题:ViT能否在推理时更高效的同时,学习全局和局部上下文?为此,我们提出了一个名为Castling-ViT的框架,该框架在训练时使用线性角注意力和基于掩码softmax的二次注意力来训练ViT,但在ViT推理时切换为仅使用线性角注意力。我们的Castling-ViT利用角核通过谱角来度量查询和键之间的相似性。我们通过两种技术进一步简化了它:(1)一种新颖的线性角注意力机制:我们将角核分解为线性项和高阶残差,并仅保留线性项;(2)我们采用两个参数化模块来近似高阶残差:一个深度卷积和一个辅助的掩码softmax注意力,以帮助学习全局和局部信息,其中softmax注意力的掩码被正则化以逐渐变为零,因此在ViT推理时不产生额外开销。在三个任务上进行的大量实验和消融研究一致验证了所提出的Castling-ViT的有效性,例如,与使用传统基于softmax注意力的ViT相比,在ImageNet分类上实现了高达1.8%的准确率提升或40%的MACs减少,在可比FLOPs下在COCO检测上实现了1.2的mAP提升。