Vision transformers have gained popularity recently, leading to the development of new vision backbones with improved features and consistent performance gains. However, these advancements are not solely attributable to novel feature transformation designs; certain benefits also arise from advanced network-level and block-level architectures. This paper aims to identify the real gains of popular convolution and attention operators through a detailed study. We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach, known as the "spatial token mixer" (STM). To facilitate an impartial comparison, we introduce a unified architecture to neutralize the impact of divergent network-level and block-level designs. Subsequently, various STMs are integrated into this unified framework for comprehensive comparative analysis. Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs, but performance differences persist among different STMs. Our detailed analysis also reveals various findings about different STMs, including effective receptive fields, invariance, and adversarial robustness tests.
翻译:近年来,视觉Transformer逐渐受到广泛关注,推动了具有改进特征和持续性能提升的新型视觉骨干网络的发展。然而,这些进步并非完全归因于新颖的特征变换设计;部分优势也源于先进的网络级和块级架构。本文旨在通过详细研究,厘清流行卷积算子与注意力算子的实际增益。我们发现,注意力或卷积等特征变换模块的关键差异在于其空间特征聚合方式,即"空间令牌混合器"。为促进公平比较,我们引入统一架构以消除不同网络级和块级设计的影响。随后,将各类空间令牌混合器集成至此统一框架中进行全面对比分析。我们在多项任务上的实验及归纳偏置分析表明:先进的网络级和块级设计能带来显著性能提升,但不同空间令牌混合器之间仍存在性能差异。我们的详细分析还揭示了关于不同空间令牌混合器的多项发现,包括有效感受野、不变性及对抗鲁棒性测试等特性。