This paper studies how to keep a vision backbone effective while removing token mixers in its basic building blocks. Token mixers, as self-attention for vision transformers (ViTs), are intended to perform information communication between different spatial tokens but suffer from considerable computational cost and latency. However, directly removing them will lead to an incomplete model structure prior, and thus brings a significant accuracy drop. To this end, we first develop an RepIdentityFormer base on the re-parameterizing idea, to study the token mixer free model architecture. And we then explore the improved learning paradigm to break the limitation of simple token mixer free backbone, and summarize the empirical practice into 5 guidelines. Equipped with the proposed optimization strategy, we are able to build an extremely simple vision backbone with encouraging performance, while enjoying the high efficiency during inference. Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy. We hope this work can serve as a starting point for the exploration of optimization-driven efficient network design. Project page: https://techmonsterwang.github.io/RIFormer/.
翻译:本文研究如何在移除基本构建模块中的Token混合器时,仍保持视觉主干网络的有效性。Token混合器作为视觉Transformer(ViT)的自注意力机制,旨在实现不同空间令牌间的信息交互,但会带来显著的计算开销与延迟。然而直接移除该组件将导致不完整的模型结构先验,进而引发严重的精度下降。为此,我们首先基于重参数化思想提出RepIdentityFormer架构,以探索无Token混合器的模型结构。进而研究改进的学习范式以突破简单无Token混合器主干网络的局限性,并将实践经验总结为五条指导原则。配合所提出的优化策略,我们能够构建一个极其简单的视觉主干网络,在保持高推理效率的同时获得令人鼓舞的性能。大量实验与消融分析表明,通过网络架构的归纳偏置,结合恰当的优化策略可融入简单网络结构。我们期望这项工作能成为探索优化驱动的高效网络设计的起点。项目主页:https://techmonsterwang.github.io/RIFormer/。