Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due to their ability to decouple model size from inference efficiency by only activating a small subset of the model parameters for any given input token. As such, sparse MoEs have enabled unprecedented scalability, resulting in tremendous successes across domains such as natural language processing and computer vision. In this work, we instead explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications. To this end, we propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts. We also propose a stable MoE training procedure that uses super-class information to guide the router. We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs. For example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only 54M FLOPs inference cost, our MoE achieves an improvement of 4.66%.
翻译:稀疏混合专家模型(MoEs)近年来因能在推理时仅激活模型参数的一小部分以处理给定输入标记,从而将模型规模与推理效率解耦而广受关注。这种机制使稀疏MoE模型实现了前所未有的可扩展性,在自然语言处理和计算机视觉等领域取得巨大成功。本研究则另辟蹊径,探索利用稀疏MoE压缩视觉Transformer(ViTs)规模,使其更适用于资源受限的视觉应用。为此,我们提出一种简化的移动端友好型MoE架构,将整张图像而非单个图像块路由至专家模块。同时提出一种稳定的MoE训练方法,利用超类信息指导路由策略。实验表明,我们的稀疏移动视觉MoE(V-MoEs)能在性能与效率之间实现优于对应密集ViT的平衡。例如,在ViT-Tiny模型上,我们的移动式V-MoE在ImageNet-1k数据集上比密集版本提升3.39%;对于推理成本仅54M FLOPs的更小型ViT变体,我们的MoE实现了4.66%的性能提升。