Expert parallelism is vital for effectively training Mixture-of-Experts (MoE) models, enabling different devices to host distinct experts, with each device processing different input data. However, during expert parallel training, dynamic routing results in significant load imbalance among experts: a handful of overloaded experts hinder overall iteration, emerging as a training bottleneck. In this paper, we introduce LAER-MoE, an efficient MoE training framework. The core of LAER-MoE is a novel parallel paradigm, Fully Sharded Expert Parallel (FSEP), which fully partitions each expert parameter by the number of devices and restores partial experts at expert granularity through All-to-All communication during training. This allows for flexible re-layout of expert parameters during training to enhance load balancing. In particular, we perform fine-grained scheduling of communication operations to minimize communication overhead. Additionally, we develop a load balancing planner to formulate re-layout strategies of experts and routing schemes for tokens during training. We perform experiments on an A100 cluster, and the results indicate that our system achieves up to 1.69x acceleration compared to the current state-of-the-art training systems. Source code available at https://github.com/PKU-DAIR/Hetu-Galvatron/tree/laer-moe.
翻译:专家并行对于有效训练混合专家(MoE)模型至关重要,它使得不同设备可以托管不同的专家,且每个设备处理不同的输入数据。然而,在专家并行训练期间,动态路由会导致专家间显著的负载不均衡:少数过载的专家会阻碍整体迭代进程,成为训练瓶颈。本文提出LAER-MoE,一种高效的MoE训练框架。其核心是一种新颖的并行范式——全分片专家并行(FSEP),该范式将每个专家参数按设备数量进行完全分区,并在训练过程中通过All-to-All通信以专家粒度恢复部分专家。这使得在训练期间可以灵活地对专家参数进行重布局以提升负载均衡。特别地,我们对通信操作进行了细粒度调度以最小化通信开销。此外,我们开发了一个负载均衡规划器,用于制定训练期间专家的重布局策略以及令牌的路由方案。我们在A100集群上进行了实验,结果表明,与当前最先进的训练系统相比,我们的系统实现了最高达1.69倍的加速。源代码可在 https://github.com/PKU-DAIR/Hetu-Galvatron/tree/laer-moe 获取。