Large-scale expert parallelism (EP) is becoming pivotal for training and serving frontier MoE models, but it also amplifies device-level expert load imbalance into compute stragglers, token all-to-all bottlenecks, and activation-memory spikes. Existing balancers redistribute experts periodically based on historical load, which becomes unreliable for production deployments with non-stationary load patterns. We present UltraEP, the first exact-load, real-time balancer for large-EP MoE training and serving prefill on rack-scale nodes (RSNs). Built upon the extended scale-up connectivity of RSNs, UltraEP rebalances every microbatch and layer on critical paths, which requires nontrivial co-design of plan solving and expert replication communication to minimize exposed overhead. To this end, UltraEP eagerly reacts to post-gating load with efficient quota-driven planning, and executes the resulting irregular expert-state transfers with RSN-native persistent tile streaming and relay-based fan-out mitigation. Averaged across MoE models from 106B to 671B parameters in training and prefill, UltraEP achieves 94.3% of the force-balanced ideal throughput, delivering 1.49$\times$ improvement over non-balancing, while reducing the final inter-rank imbalance from 1.30$-$4.01 to 1.01$-$1.04. Additionally, we validate UltraEP's scalability and robustness in production MoE training with 2560 GPUs.
翻译:大规模专家并行(EP)正成为训练和服务前沿MoE模型的关键技术,但其也加剧了设备级专家负载不均衡问题,进而引发计算拖尾、令牌全交换瓶颈及激活内存峰值。现有均衡器基于历史负载周期性地重新分配专家,在具有非平稳负载模式的生产部署中可靠性不足。我们提出UltraEP,首个面向大规模EP MoE训练与预填充服务、运行在机架级节点(RSN)上的精确负载实时均衡器。基于RSN扩展的纵向连接能力,UltraEP在关键路径上对每个微批次和网络层进行再均衡,这要求规划求解与专家复制通信的协同设计以实现暴露开销最小化。为此,UltraEP通过高效配额驱动规划对门控后负载进行即时响应,并利用RSN原生的持久化瓦片流式传输与中继式扇出缓解机制执行由此产生的不规则专家状态迁移。在训练与预填充场景下对106B至671B参数MoE模型的平均测试表明,UltraEP可达力平衡理想吞吐量的94.3%,相较无均衡方案实现1.49倍提升,并将最终跨秩不均衡度从1.30-4.01降至1.01-1.04。此外,我们在2560 GPU的生产级MoE训练中验证了UltraEP的可扩展性与鲁棒性。