Vision Transformers (ViTs) have been shown to be effective in various vision tasks. However, resizing them to a mobile-friendly size leads to significant performance degradation. Therefore, developing lightweight vision transformers has become a crucial area of research. This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement. CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention, then proposes an effective and straightforward module to capture high-frequency local information. In CloFormer, we introduce AttnConv, a convolution operator in attention's style. The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features. The combination of the AttnConv and vanilla attention which uses pooling to reduce FLOPs in CloFormer enables the model to perceive high-frequency and low-frequency information. Extensive experiments were conducted in image classification, object detection, and semantic segmentation, demonstrating the superiority of CloFormer.
翻译:视觉Transformer已被证明在各种视觉任务中有效。然而,将其缩放到移动设备友好的尺寸会导致显著的性能下降。因此,开发轻量级视觉Transformer已成为一个关键研究领域。本文提出CloFormer,一种利用上下文感知局部增强的轻量级视觉Transformer。CloFormer探索了普通卷积算子中常用的全局共享权重与注意力机制中出现的token特定上下文感知权重之间的关系,并提出了一种有效且直接的模块来捕获高频局部信息。在CloFormer中,我们引入了AttnConv,一种注意力风格下的卷积算子。所提出的AttnConv使用共享权重聚合局部信息,并部署精心设计的上下文感知权重来增强局部特征。AttnConv与使用池化减少FLOPs的普通注意力相结合,使模型能够感知高频和低频信息。我们在图像分类、目标检测和语义分割中进行了大量实验,证明了CloFormer的优越性。