Scale-Aware Modulation Meet Transformer

This paper presents a new vision Transformer, Scale-Aware Modulation Transformer (SMT), that can handle various downstream tasks efficiently by combining the convolutional network and vision Transformer. The proposed Scale-Aware Modulation (SAM) in the SMT includes two primary novel designs. Firstly, we introduce the Multi-Head Mixed Convolution (MHMC) module, which can capture multi-scale features and expand the receptive field. Secondly, we propose the Scale-Aware Aggregation (SAA) module, which is lightweight but effective, enabling information fusion across different heads. By leveraging these two modules, convolutional modulation is further enhanced. Furthermore, in contrast to prior works that utilized modulations throughout all stages to build an attention-free network, we propose an Evolutionary Hybrid Network (EHN), which can effectively simulate the shift from capturing local to global dependencies as the network becomes deeper, resulting in superior performance. Extensive experiments demonstrate that SMT significantly outperforms existing state-of-the-art models across a wide range of visual tasks. Specifically, SMT with 11.5M / 2.4GFLOPs and 32M / 7.7GFLOPs can achieve 82.2% and 84.3% top-1 accuracy on ImageNet-1K, respectively. After pretrained on ImageNet-22K in 224^2 resolution, it attains 87.1% and 88.1% top-1 accuracy when finetuned with resolution 224^2 and 384^2, respectively. For object detection with Mask R-CNN, the SMT base trained with 1x and 3x schedule outperforms the Swin Transformer counterpart by 4.2 and 1.3 mAP on COCO, respectively. For semantic segmentation with UPerNet, the SMT base test at single- and multi-scale surpasses Swin by 2.0 and 1.1 mIoU respectively on the ADE20K.

翻译：本文提出了一种新型视觉Transformer——尺度感知调制Transformer（SMT），通过融合卷积网络与视觉Transformer，高效处理多种下游任务。SMT中的尺度感知调制（SAM）包含两项核心创新设计。首先，我们引入了多头混合卷积（MHMC）模块，该模块能够捕获多尺度特征并扩展感受野。其次，提出了轻量高效的尺度感知聚合（SAA）模块，实现不同头之间的信息融合。通过这两个模块的协同作用，卷积调制能力得到进一步增强。此外，与现有工作中在全阶段使用调制构建无注意力网络不同，我们提出了演进式混合网络（EHN），能有效模拟网络加深过程中从捕获局部依赖到全局依赖的转变，从而获得更优性能。大量实验表明，SMT在广泛视觉任务中显著超越现有最优模型。具体而言，参数量11.5M/计算量2.4GFLOPs和32M/7.7GFLOPs的SMT模型在ImageNet-1K上分别达到82.2%和84.3%的top-1准确率。在224^2分辨率下经ImageNet-22K预训练后，以224^2和384^2分辨率微调分别取得87.1%和88.1%的top-1准确率。在目标检测任务中，基于Mask R-CNN框架，采用1x和3x训练策略的SMT基础模型在COCO数据集上分别比Swin Transformer高4.2和1.3个mAP。在语义分割任务中，基于UPerNet框架，SMT基础模型在ADE20K数据集上单尺度和多尺度测试分别比Swin高2.0和1.1个mIoU。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日