Recently, Transformers have emerged as the go-to architecture for both vision and language modeling tasks, but their computational efficiency is limited by the length of the input sequence. To address this, several efficient variants of Transformers have been proposed to accelerate computation or reduce memory consumption while preserving performance. This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens, bringing several technical contributions: 1) Convolutional activation is used to pre-process the token after patchifying the image to select and rearrange the major tokens and minor tokens, which substantially reduces the computation cost through an additional fusion layer. 2) Instead of using the class activation map of the convolutional model directly, we design a new weighted class activation to lower the model requirements. 3) To facilitate communication between major tokens and fusion tokens, Gated Linear SRA is proposed to further integrate fusion tokens into the attention mechanism. We perform a comprehensive validation of CageViT on the image classification challenge. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency, while maintaining a comparable level of accuracy (e.g. a moderate-sized 43.35M model trained solely on 224 x 224 ImageNet-1K can achieve Top-1 accuracy of 83.4% accuracy).
翻译:近期,Transformer已成为视觉和语言建模任务的主流架构,但其计算效率受限于输入序列长度。为解决这一问题,研究人员提出了多种高效Transformer变体,在保持性能的同时加速计算或降低内存消耗。本文提出了一种名为CageViT的高效视觉Transformer,其通过卷积激活引导来减少计算量。与现有Transformer不同,我们的CageViT采用新型编码器处理重排后的标记,并贡献了以下技术:1)利用卷积激活对图像分块后的标记进行预处理,筛选并重排主要标记与次要标记,通过附加融合层大幅降低计算成本;2)不再直接使用卷积模型的类别激活图,而是设计新型加权类别激活以降低模型要求;3)为促进主要标记与融合标记间的信息交互,提出门控线性SRA机制,将融合标记进一步集成至注意力机制中。我们在图像分类任务上对CageViT进行了全面验证。实验结果表明,所提出的CageViT在效率上显著超越现有最先进骨干网络,同时保持可比的精度(例如,仅在224×224的ImageNet-1K上训练的43.35M中等规模模型,可实现83.4%的Top-1准确率)。