Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability. However, the pairwise token affinity and complex matrix operations limit its deployment on resource-constrained scenarios and real-time applications, such as mobile devices, although considerable efforts have been made in previous works. In this paper, we introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Firstly, we argue that the capability of token mixers to obtain global contextual information hinges on multiple information interactions, such as spatial and channel domains. Subsequently, we construct a novel additive similarity function following this paradigm and present an efficient implementation named Convolutional Additive Token Mixer (CATM). This simplification leads to a significant reduction in computational overhead. We evaluate CAS-ViT across a variety of vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation. Our experiments, conducted on GPUs, ONNX, and iPhones, demonstrate that CAS-ViT achieves a competitive performance when compared to other state-of-the-art backbones, establishing it as a viable option for efficient mobile vision applications. Our code and model are available at: \url{https://github.com/Tianfang-Zhang/CAS-ViT}
翻译:视觉Transformer(ViTs)凭借其令牌混合器强大的全局上下文能力,标志着神经网络领域的革命性进展。然而,尽管先前的研究付出了大量努力,但成对令牌亲和度与复杂的矩阵运算限制了其在资源受限场景(如移动设备)和实时应用中的部署。本文提出CAS-ViT:卷积加性自注意力视觉Transformer,旨在移动应用中实现效率与性能的平衡。首先,我们认为令牌混合器获取全局上下文信息的能力取决于多种信息交互(如空间域与通道域)。随后,我们遵循此范式构建了一种新颖的加性相似度函数,并提出名为卷积加性令牌混合器(CATM)的高效实现方案。该简化显著降低了计算开销。我们在多种视觉任务(包括图像分类、目标检测、实例分割和语义分割)上评估CAS-ViT。在GPU、ONNX和iPhone平台上进行的实验表明,CAS-ViT相较于其他先进骨干网络具有竞争力,成为高效移动视觉应用的可行选择。代码与模型已开源:\url{https://github.com/Tianfang-Zhang/CAS-ViT}