Structural re-parameterization is a general training scheme for Convolutional Neural Networks (CNNs), which achieves performance improvement without increasing inference cost. As Vision Transformers (ViTs) are gradually surpassing CNNs in various visual tasks, one may question: if a training scheme specifically for ViTs exists that can also achieve performance improvement without increasing inference cost? Recently, Mixture-of-Experts (MoE) has attracted increasing attention, as it can efficiently scale up the capacity of Transformers at a fixed cost through sparsely activated experts. Considering that MoE can also be viewed as a multi-branch structure, can we utilize MoE to implement a ViT training scheme similar to structural re-parameterization? In this paper, we affirmatively answer these questions, with a new general training strategy for ViTs. Specifically, we decouple the training and inference phases of ViTs. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs that assign tokens to experts by random uniform partition, and perform Experts Weights Averaging (EWA) on these MoEs at the end of each iteration. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference. We further provide a theoretical analysis to show why and how it works. Comprehensive experiments across various 2D and 3D visual tasks, ViT architectures, and datasets validate the effectiveness and generalizability of the proposed training scheme. Besides, our training scheme can also be applied to improve performance when fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can significantly improve the effectiveness of naive MoE in various 2D visual small datasets and 3D visual tasks.
翻译:结构重参数化是卷积神经网络(CNN)的一种通用训练方案,能在不增加推理成本的情况下提升性能。当视觉Transformer(ViTs)在各类视觉任务中逐渐超越CNN时,人们可能会问:是否存在一种专用于ViT的训练方案,同样能实现不增加推理成本的性能提升?近年来,混合专家模型(MoE)通过稀疏激活的专家以固定成本高效扩展Transformer容量,引起了广泛关注。考虑到MoE也可视为多分支结构,我们能否利用MoE实现类似结构重参数化的ViT训练方案?本文对上述问题做出肯定回答,并提出一种面向ViT的新型通用训练策略。具体而言,我们将ViT的训练与推理阶段解耦:在训练过程中,将ViT的部分前馈网络(FFN)替换为经过特殊设计、更高效的MoE——通过随机均匀划分将词元分配给专家,并在每次迭代结束时对这些MoE执行专家权重平均(EWA)。训练完成后,通过对专家权重取平均将每个MoE转换回FFN,使模型恢复为原始ViT用于推理。我们进一步从理论角度分析了该方案的有效性及其作用机理。在多种2D与3D视觉任务、ViT架构及数据集上的综合实验验证了所提训练方案的有效性和泛化能力。此外,该训练方案还可应用于ViT微调以提升性能。最后但同样重要的是,所提出的EWA技术能显著提升朴素MoE在多种2D小样本视觉数据集及3D视觉任务中的有效性。