The Sparse Mixture of Experts (SMoE) has been widely employed to enhance the efficiency of training and inference for Transformer-based foundational models, yielding promising results.However, the performance of SMoE heavily depends on the choice of hyper-parameters, such as the number of experts and the number of experts to be activated (referred to as top-k), resulting in significant computational overhead due to the extensive model training by searching over various hyper-parameter configurations. As a remedy, we introduce the Dynamic Mixture of Experts (DynMoE) technique. DynMoE incorporates (1) a novel gating method that enables each token to automatically determine the number of experts to activate. (2) An adaptive process automatically adjusts the number of experts during training. Extensive numerical results across Vision, Language, and Vision-Language tasks demonstrate the effectiveness of our approach to achieve competitive performance compared to GMoE for vision and language tasks, and MoE-LLaVA for vision-language tasks, while maintaining efficiency by activating fewer parameters. Our code is available at https://github.com/LINs-lab/DynMoE.
翻译:稀疏专家混合(SMoE)已被广泛用于提升基于Transformer的基础模型的训练与推理效率,并取得了显著成果。然而,SMoE的性能高度依赖于超参数的选择,例如专家总数和待激活的专家数量(即top-k),这导致需要通过大量模型训练来搜索不同的超参数配置,从而产生显著的计算开销。为解决这一问题,我们提出了动态专家混合(DynMoE)技术。DynMoE包含:(1)一种新颖的门控方法,使每个token能够自动决定激活的专家数量;(2)一种自适应过程,可在训练期间自动调整专家数量。在视觉、语言及视觉-语言任务上进行的大量数值实验表明,我们的方法在视觉和语言任务上相比GMoE、在视觉-语言任务上相比MoE-LLaVA均能取得具有竞争力的性能,同时通过激活更少的参数保持了高效性。我们的代码公开于https://github.com/LINs-lab/DynMoE。