Fine-grained, per-micro-batch load balancing is essential for efficient Mixture-of-Experts (MoE) training, yet every prior dynamic scheduling scheme pays for it with extra communication that is hard to hide. Especially on modern bulk-transfer backends such as DeepEP. We make a simple but consequential observation: on the NVIDIA Hopper architecture the NVLink Copy Engine can move data between intra-node GPUs without consuming any SM cycles, effectively providing a nearly free communication channel that runs in parallel with compute kernels. FEPLB turns this idle hardware into a new parallel dimension for MoE load rebalancing. Its Two-Phase Dispatch first routes tokens across nodes via the standard EP backend, then redistributes dynamic-expert tokens and weights within the NVLink domain through the Copy Engine at nearly zero cost, while a lightweight CPU scheduler runs concurrently with static expert computation. Because FEPLB uses only Copy Engine and CPU that are orthogonal to those consumed by EP and PP, it coexists with existing parallel strategies without reconfiguration. On GLM-5's MoE layers (128 experts, no auxiliary loss, up to 16 H100 GPUs), FEPLB reduces the token straggler by 51-70% and the GEMM straggler by 50-68% with no measurable EP communication overhead. Its advantage grows with the EP degree: at EP=8, it achieves 2x lower token straggler than FasterMoE.
翻译:细粒度、每微批次的负载均衡对于高效的混合专家(MoE)训练至关重要,然而所有先前的动态调度方案都需要通过难以隐藏的额外通信来为此付出代价——尤其是在现代批量传输后端(如DeepEP)上更是如此。我们提出一个简单但意义重大的观察:在NVIDIA Hopper架构中,NVLink复制引擎可以在不消耗任何SM周期的情况下在节点内GPU之间移动数据,从而有效提供一个与计算内核并行运行的近乎免费的通信通道。FEPLB将这一闲置硬件转化为MoE负载重新平衡的新并行维度。其两阶段分发机制首先通过标准EP后端跨节点路由令牌,然后通过复制引擎在NVLink域内以近乎零成本重新分配动态专家令牌和权重,同时轻量级CPU调度器与静态专家计算并行运行。由于FEPLB仅使用与EP和PP所消耗资源正交的复制引擎和CPU,因此它无需重新配置即可与现有并行策略共存。在GLM-5的MoE层(128个专家,无辅助损失,最多16块H100 GPU)上,FEPLB将令牌落后者减少51-70%,GEMM落后者减少50-68%,且无可测量的EP通信开销。其优势随EP度数增长:在EP=8时,其令牌落后者比FasterMoE低2倍。