Vision Transformers (ViTs) have been shown to be effective in various vision tasks. However, resizing them to a mobile-friendly size leads to significant performance degradation. Therefore, developing lightweight vision transformers has become a crucial area of research. This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement. CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention, then proposes an effective and straightforward module to capture high-frequency local information. In CloFormer, we introduce AttnConv, a convolution operator in attention's style. The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features. The combination of the AttnConv and vanilla attention which uses pooling to reduce FLOPs in CloFormer enables the model to perceive high-frequency and low-frequency information. Extensive experiments were conducted in image classification, object detection, and semantic segmentation, demonstrating the superiority of CloFormer.
翻译:视觉Transformer(ViTs)已被证明在多种视觉任务中有效。然而,将其压缩至移动端适配的规模会导致显著的性能下降。因此,开发轻量级视觉Transformer已成为关键研究领域。本文提出CloFormer,一种利用上下文感知局部增强的轻量级视觉Transformer。CloFormer探索了标准卷积算子中常用的全局共享权重与注意力机制中出现的词元特定上下文感知权重之间的关系,进而提出一种高效简洁的模块来捕获高频局部信息。在CloFormer中,我们引入了AttnConv(注意力风格的卷积算子)。所提出的AttnConv使用共享权重聚合局部信息,并部署精心设计的上下文感知权重以增强局部特征。AttnConv与采用池化减少计算量的标准注意力机制相结合,使模型能够感知高频与低频信息。在图像分类、目标检测和语义分割任务上的大量实验证明了CloFormer的优越性。