Vision Transformers (ViTs) have been shown to be effective in various vision tasks. However, resizing them to a mobile-friendly size leads to significant performance degradation. Therefore, developing lightweight vision transformers has become a crucial area of research. This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement. CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention, then proposes an effective and straightforward module to capture high-frequency local information. In CloFormer, we introduce AttnConv, a convolution operator in attention's style. The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features. The combination of the AttnConv and vanilla attention which uses pooling to reduce FLOPs in CloFormer enables the model to perceive high-frequency and low-frequency information. Extensive experiments were conducted in image classification, object detection, and semantic segmentation, demonstrating the superiority of CloFormer. The code is available at \url{https://github.com/qhfan/CloFormer}.
翻译:视觉Transformer(ViTs)已被证明在各种视觉任务中表现有效。然而,将其缩减为移动端适用尺寸会导致显著的性能下降。因此,开发轻量级视觉Transformer已成为一个至关重要的研究领域。本文提出CloFormer,一种利用上下文感知局部增强的轻量级视觉Transformer。CloFormer探索了标准卷积算子中常用的全局共享权重与注意力机制中出现的令牌特定上下文感知权重之间的关系,并提出一种有效且直接的模块来捕获高频局部信息。在CloFormer中,我们引入了AttnConv,一种采用注意力风格的卷积算子。所提出的AttnConv使用共享权重聚合局部信息,并部署精心设计的上下文感知权重以增强局部特征。AttnConv与使用池化减少FLOPs的原始注意力的结合,使得模型能够感知高频和低频信息。我们在图像分类、目标检测和语义分割中进行了广泛实验,证明了CloFormer的优越性。代码可在\url{https://github.com/qhfan/CloFormer}获取。