Vision Transformers have shown great potential in computer vision tasks. Most recent works have focused on elaborating the spatial token mixer for performance gains. However, we observe that a well-designed general architecture can significantly improve the performance of the entire backbone, regardless of which spatial token mixer is equipped. In this paper, we propose UniNeXt, an improved general architecture for the vision backbone. To verify its effectiveness, we instantiate the spatial token mixer with various typical and modern designs, including both convolution and attention modules. Compared with the architecture in which they are first proposed, our UniNeXt architecture can steadily boost the performance of all the spatial token mixers, and narrows the performance gap among them. Surprisingly, our UniNeXt equipped with naive local window attention even outperforms the previous state-of-the-art. Interestingly, the ranking of these spatial token mixers also changes under our UniNeXt, suggesting that an excellent spatial token mixer may be stifled due to a suboptimal general architecture, which further shows the importance of the study on the general architecture of vision backbone. All models and codes will be publicly available.
翻译:视觉Transformer在计算机视觉任务中展现出巨大潜力。近期研究工作多聚焦于精细化设计空间标记混合器以提升性能。然而我们发现,无论配备何种空间标记混合器,精心设计的通用架构都能显著提升整体骨干网络的性能。本文提出UniNeXt——一种面向视觉骨干网络的改进通用架构。为验证其有效性,我们采用包括卷积模块和注意力模块在内的多种典型及现代设计实例化空间标记混合器。相较于各空间标记混合器首次提出时的原始架构,UniNeXt架构能够稳定提升所有空间标记混合器的性能,并缩小了它们之间的性能差距。令人惊讶的是,配备朴素局部窗口注意力的UniNeXt甚至超越了此前的最优方法。更值得关注的是,在UniNeXt架构下,这些空间标记混合器的性能排名亦发生改变,这表明卓越的空间标记混合器可能因次优的通用架构而受限,进一步凸显了视觉骨干网络通用架构研究的重要性。所有模型与代码将开源发布。