Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a quadratic complexity with the number of input tokens. Existing efficient ViTs adopt local attention (e.g., Swin) or linear attention (e.g., Performer), which sacrifice ViTs' capabilities of capturing either global or local context. In this work, we ask an important research question: Can ViTs learn both global and local context while being more efficient during inference? To this end, we propose a framework called Castling-ViT, which trains ViTs using both linear-angular attention and masked softmax-based quadratic attention, but then switches to having only linear angular attention during ViT inference. Our Castling-ViT leverages angular kernels to measure the similarities between queries and keys via spectral angles. And we further simplify it with two techniques: (1) a novel linear-angular attention mechanism: we decompose the angular kernels into linear terms and high-order residuals, and only keep the linear terms; and (2) we adopt two parameterized modules to approximate high-order residuals: a depthwise convolution and an auxiliary masked softmax attention to help learn both global and local information, where the masks for softmax attention are regularized to gradually become zeros and thus incur no overhead during ViT inference. Extensive experiments and ablation studies on three tasks consistently validate the effectiveness of the proposed Castling-ViT, e.g., achieving up to a 1.8% higher accuracy or 40% MACs reduction on ImageNet classification and 1.2 higher mAP on COCO detection under comparable FLOPs, as compared to ViTs with vanilla softmax-based attentions.
翻译:视觉Transformer(ViTs)虽展现出优异性能,但其计算成本仍显著高于卷积神经网络(CNNs),原因之一在于ViT的注意力机制需要度量全局相似性,导致计算复杂度随输入令牌数量呈二次增长。现有高效ViT方案采用局部注意力(如Swin)或线性注意力(如Performer),但这些方法牺牲了ViT捕获全局或局部上下文的能力。本文提出一个关键研究问题:ViT能否在保持高效推理的同时兼顾全局与局部上下文学习能力?为此,我们提出名为Castling-ViT的框架,该框架在训练阶段同时使用线性-角注意力与基于掩码的Softmax二次注意力,而在推理阶段仅保留线性-角注意力。Castling-ViT通过谱角度利用角核度量查询与键之间的相似性,并采用两种技术进一步简化:()新型线性-角注意力机制:将角核分解为线性项与高阶残差项,仅保留线性项;(2)引入两个参数化模块逼近高阶残差:深度可分离卷积与辅助掩码Softmax注意力,前者学习全局与局部信息,后者通过正则化使掩码逐渐归零,从而在推理阶段不引入额外开销。针对三项任务的广泛实验与消融研究一致验证了Castling-ViT的有效性:与采用标准Softmax注意力的ViT相比,在ImageNet分类任务中最高提升1.8%精度或降低40%乘累加运算量(MACs),在COCO检测任务中在相似FLOPs下平均精度(mAP)提升1.2。