Structural re-parameterization is a general training scheme for Convolutional Neural Networks (CNNs), which achieves performance improvement without increasing inference cost. As Vision Transformers (ViTs) are gradually surpassing CNNs in various visual tasks, one may question: if a training scheme specifically for ViTs exists that can also achieve performance improvement without increasing inference cost? Recently, Mixture-of-Experts (MoE) has attracted increasing attention, as it can efficiently scale up the capacity of Transformers at a fixed cost through sparsely activated experts. Considering that MoE can also be viewed as a multi-branch structure, can we utilize MoE to implement a ViT training scheme similar to structural re-parameterization? In this paper, we affirmatively answer these questions, with a new general training strategy for ViTs. Specifically, we decouple the training and inference phases of ViTs. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs that assign tokens to experts by random uniform partition, and perform Experts Weights Averaging (EWA) on these MoEs at the end of each iteration. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference. We further provide a theoretical analysis to show why and how it works. Comprehensive experiments across various 2D and 3D visual tasks, ViT architectures, and datasets validate the effectiveness and generalizability of the proposed training scheme. Besides, our training scheme can also be applied to improve performance when fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can significantly improve the effectiveness of naive MoE in various 2D visual small datasets and 3D visual tasks.
翻译:结构重参数化是卷积神经网络(CNNs)的一种通用训练方案,该方案在不增加推理成本的前提下提升了性能。随着视觉Transformer(ViTs)在各类视觉任务中逐渐超越CNNs,人们不禁要问:是否存在一种专为ViTs设计的训练方案,同样能在不增加推理成本的情况下实现性能提升?近期,混合专家模型(MoE)因其通过稀疏激活专家以固定成本高效扩展Transformer容量而备受关注。考虑到MoE也可被视为一种多分支结构,我们能否利用MoE实现类似于结构重参数化的ViT训练方案?本文对此类问题给出了肯定回答,提出了一种面向ViTs的新型通用训练策略。具体而言,我们将ViTs的训练阶段与推理阶段解耦:训练时,将ViT的部分前馈网络(FFNs)替换为经过特殊设计、效率更高的MoE模块(通过随机均匀划分将token分配给专家),并在每次迭代结束时对这些MoE执行专家权重平均(EWA);训练完成后,通过对专家权重取平均将每个MoE转换为FFN,使模型恢复为原始ViT架构以用于推理。我们进一步提供了理论分析,阐释该方案的工作原理及其有效性。在多种2D和3D视觉任务、ViT架构及数据集上的综合实验验证了所提训练方案的有效性和通用性。此外,本方案还可应用于ViT微调阶段以提升性能。最后(但同等重要的是),所提出的EWA技术能显著提升朴素MoE在各类2D视觉小样本数据集和3D视觉任务中的有效性。