Mixture-of-Experts (MoE) and reinforcement learning (RL) post-training now dominate large language model (LLM) development, yet expert load imbalance remains a critical challenge. Existing load-balancing systems target pre-training by relying on historical step-level statistics. However, these methods fail under the unique workload dynamics of RL post-training: the step-level load is stable, but the tiny batch sizes processed during micro-steps cause severe, high-frequency load fluctuations. We introduce ForeMoE, a micro-step-level load balancing system for MoE RL post-training. Instead of relying on historical statistics, ForeMoE exploits the multi-stage RL pipeline (rollout, recompute, policy update) by using foreseeable routing information from the rollout stage to proactively guide load balancing in the remaining stages. To support frequent per-micro-step reconfiguration, ForeMoE employs a hierarchical planner that decomposes the NP-hard load balancing problem into tractable sub-components, alongside a transfer engine that leverages complementary hardware paths (CPU-assisted and GPU-direct) for overlapped expert transfer. Evaluations on 64 GPUs demonstrate that ForeMoE achieves up to a 1.45$\times$ speedup over state-of-the-art RL post-training systems.
翻译:混合专家模型(Mixture-of-Experts, MoE)与强化学习(Reinforcement Learning, RL)后训练如今主导了大语言模型(Large Language Model, LLM)的发展,但专家负载不均衡仍是关键挑战。现有负载均衡系统针对预训练阶段设计,依赖历史步级统计信息。然而,这些方法在RL后训练独特的负载动态下失效:步级负载虽稳定,但微步(micro-step)处理中的极小批量数据导致严重的高频负载波动。我们提出ForeMoE——面向MoE RL后训练的微步级负载均衡系统。与依赖历史统计不同,ForeMoE利用多阶段RL流水线(rollout、recompute、policy update)的可预见性,通过rollout阶段的路由信息主动引导后续阶段的负载均衡。为支持频繁的微步级重配置,ForeMoE采用分层规划器,将NP难的负载均衡问题分解为可求解的子组件,并配备传输引擎利用互补硬件路径(CPU辅助与GPU直连)实现专家传输重叠。在64块GPU上的评估表明,ForeMoE相比现有最优RL后训练系统实现高达1.45倍的加速。