This paper presents a module, Spatial Cross-scale Convolution (SCSC), which is verified to be effective in improving both CNNs and Transformers. Nowadays, CNNs and Transformers have been successful in a variety of tasks. Especially for Transformers, increasing works achieve state-of-the-art performance in the computer vision community. Therefore, researchers start to explore the mechanism of those architectures. Large receptive fields, sparse connections, weight sharing, and dynamic weight have been considered keys to designing effective base models. However, there are still some issues to be addressed: large dense kernels and self-attention are inefficient, and large receptive fields make it hard to capture local features. Inspired by the above analyses and to solve the mentioned problems, in this paper, we design a general module taking in these design keys to enhance both CNNs and Transformers. SCSC introduces an efficient spatial cross-scale encoder and spatial embed module to capture assorted features in one layer. On the face recognition task, FaceResNet with SCSC can improve 2.7% with 68% fewer FLOPs and 79% fewer parameters. On the ImageNet classification task, Swin Transformer with SCSC can achieve even better performance with 22% fewer FLOPs, and ResNet with CSCS can improve 5.3% with similar complexity. Furthermore, a traditional network (e.g., ResNet) embedded with SCSC can match Swin Transformer's performance.
翻译:摘要:本文提出了一种名为空间跨尺度卷积(SCSC)的模块,经验证可有效提升CNN和Transformer的性能。当前,CNN与Transformer已在多种任务中取得显著成功。特别是Transformer,其在计算机视觉领域正以日益增长的研究工作实现最先进性能。因此,研究者开始探索这些架构的内在机制:大感受野、稀疏连接、权重共享以及动态权重被认为设计高效基础模型的关键要素。然而,仍存在若干待解决问题:大尺寸密集核与自注意力机制计算效率低下,大感受野则难以捕捉局部特征。受上述分析启发并针对所述问题,本文设计了一个融合上述设计关键要素的通用模块,以同时增强CNN和Transformer。SCSC通过引入高效的空间跨尺度编码器与空间嵌入模块,在单层内捕获多样化特征。在面部识别任务中,嵌入SCSC的FaceResNet在减少68%的FLOPs和79%的参数量的情况下,性能提升2.7%。在ImageNet分类任务中,嵌入SCSC的Swin Transformer在减少22%的FLOPs时实现更优性能,而嵌入SCSC的ResNet在相近复杂度下性能提升5.3%。此外,嵌入SCSC的传统网络(如ResNet)可匹敌Swin Transformer的性能。