Convolutional neural networks (CNNs) and vision transformers (ViTs) have achieved remarkable success in various vision tasks. However, many architectures do not consider interactions between feature maps from different stages and scales, which may limit their performance. In this work, we propose a simple add-on attention module to overcome these limitations via multi-stage and cross-scale interactions. Specifically, the proposed Multi-Stage Cross-Scale Attention (\meth) module takes feature maps from different stages to enable multi-stage interactions and achieves cross-scale interactions by computing self-attention at different scales based on the multi-stage feature maps. Our experiments on several downstream tasks show that \meth~provides a significant performance boost with modest additional FLOPs and runtime.
翻译:卷积神经网络(CNN)和视觉变换器(ViT)已在各类视觉任务中取得显著成功。然而,许多架构并未考虑不同阶段与尺度特征图之间的交互作用,这可能会限制其性能。针对这一问题,本文提出一种简单的即插即用注意力模块,通过多阶段与跨尺度交互来突破上述局限。具体而言,所提出的多阶段跨尺度注意力(\meth)模块通过整合不同阶段的特征图实现多阶段交互,并基于多阶段特征图在不同尺度上计算自注意力,从而达成跨尺度交互。在多个下游任务上的实验表明,\meth在仅增加少量浮点运算数与运行时开销的情况下,即可带来显著的性能提升。