Mixture-of-Experts (MoE) has emerged as a promising approach to scale up deep learning models due to its significant reduction in computational resources. However, the dynamic nature of MoE leads to load imbalance among experts, severely impacting training efficiency. While previous research has attempted to address the load balancing challenge, existing solutions either compromise model accuracy or introduce additional system overhead. As a result, they fail to achieve fine-grained load balancing, which is crucial to optimizing training efficiency. We propose a novel parallelization strategy to achieve fine-grained load balancing in MoE systems. Our system is capable of achieving optimal load balancing in every micro-batch through efficient token scheduling across GPUs. Our experimental results demonstrate that MicroMoE improves the end-to-end training throughput by up to 47.6% compared with the state-of-the-art system, and almost consistently achieves optimal load balance among GPUs.
翻译:混合专家模型因其显著降低计算资源需求,已成为扩展深度学习模型规模的重要方法。然而,MoE的动态特性导致专家间负载不均衡,严重影响训练效率。尽管已有研究尝试解决负载均衡问题,但现有方案要么牺牲模型精度,要么引入额外系统开销,因而无法实现对训练效率优化至关重要的细粒度负载均衡。本文提出一种创新的并行化策略,以实现MoE系统中的细粒度负载均衡。该系统通过跨GPU的高效令牌调度,能够在每个微批次中实现最优负载均衡。实验结果表明,与现有最优系统相比,MicroMoE将端到端训练吞吐量最高提升47.6%,且几乎持续保持GPU间的最优负载均衡。