Recent advances in vision transformers (ViTs) have achieved great performance in visual recognition tasks. Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations, but these networks are spatially local. ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices. In this paper, we propose cross feature attention (XFA) to bring down computation cost for transformers, and combine efficient mobile CNNs to form a novel efficient light-weight CNN-ViT hybrid model, XFormer, which can serve as a general-purpose backbone to learn both global and local representation. Experimental results show that XFormer outperforms numerous CNN and ViT-based models across different tasks and datasets. On ImageNet1K dataset, XFormer achieves top-1 accuracy of 78.5% with 5.5 million parameters, which is 2.2% and 6.3% more accurate than EfficientNet-B0 (CNN-based) and DeiT (ViT-based) for similar number of parameters. Our model also performs well when transferring to object detection and semantic segmentation tasks. On MS COCO dataset, XFormer exceeds MobileNetV2 by 10.5 AP (22.7 -> 33.2 AP) in YOLOv3 framework with only 6.3M parameters and 3.8G FLOPs. On Cityscapes dataset, with only a simple all-MLP decoder, XFormer achieves mIoU of 78.5 and FPS of 15.3, surpassing state-of-the-art lightweight segmentation networks.
翻译:近期视觉Transformer(ViTs)的进展在视觉识别任务中取得了卓越性能。卷积神经网络(CNNs)利用空间归纳偏置学习视觉表征,但这类网络具有空间局部性。ViTs凭借自注意力机制能够学习全局表征,但通常参数量庞大且不适用于移动设备。本文提出跨特征注意力(XFA)以降低Transformer的计算成本,并与高效移动CNN结合,形成新型轻量化CNN-ViT混合模型XFormer,可作为学习全局与局部表征的通用骨干网络。实验结果表明,XFormer在多项任务和数据集上优于众多基于CNN和ViT的模型。在ImageNet1K数据集上,XFormer以5.5M参数量达到78.5%的top-1准确率,在相似参数量下比EfficientNet-B0(基于CNN)和DeiT(基于ViT)分别高出2.2%和6.3%。该模型在迁移至目标检测和语义分割任务时同样表现优异。在MS COCO数据集上,XFormer在YOLOv3框架中仅以6.3M参数和3.8G FLOPs便超越MobileNetV2达10.5 AP(22.7→33.2 AP)。在Cityscapes数据集上,XFormer仅使用简单全MLP解码器即可达成78.5 mIoU和15.3 FPS,超越现有最先进轻量级分割网络。