The exponential growth in Large Language Model (LLM) parameters has transformed model training into an increasingly resource-intensive endeavor. With the stagnation of Moore's Law and the widening disparity between computation throughput and communication bandwidth, expert parallelism (EP) has emerged as a critical strategy for scaling mixture-of-experts (MoE) models. However, despite numerous proposals for optimizing EP, ranging from communication compression to computation-communication overlap, adoption within production-grade frameworks like Megatron-LM remains conservative. Existing solutions often rely on ad-hoc, complex kernels that lack adaptability across diverse optimization configurations and frequently neglect numerical stability, failing to meet the strict precision requirements of large-scale training. In this paper, we introduce UniEP, a novel system that unifies diverse EP optimization strategies into a cohesive abstraction. UniEP fuses the MoE communication and computation into MegaKernels, effectively transforming complex architectural tuning into a unified parameter search space for automated adaptability. Crucially, UniEP incorporates a deterministic token ordering mechanism that guarantees numerical consistency with sequential execution, even under aggressive overlap schedules. We evaluate UniEP on GPU clusters equipped with NVIDIA Hopper GPUs. Our results demonstrate that UniEP achieves 1.03$\times$-1.38$\times$ speedups over state-of-the-art work, effectively mitigating communication bottlenecks while maintaining the rigorous accuracy standards required for production LLM training.
翻译:摘要:大语言模型参数的指数级增长,使模型训练成为资源消耗日益严重的工作。随着摩尔定律的停滞以及计算吞吐量与通信带宽之间差距的扩大,专家并行已成为扩展混合专家模型的关键策略。然而,尽管从通信压缩到计算-通信重叠等优化EP的方案层出不穷,但在Megatron-LM等生产级框架中的采用仍趋于保守。现有方案通常依赖临时性的复杂内核,缺乏跨多样化优化配置的适应性,且常忽视数值稳定性,难以满足大规模训练的严格精度要求。本文提出UniEP,一种将多种EP优化策略统一为连贯抽象的新型系统。UniEP将MoE的通信与计算融合为MegaKernel,有效将复杂的架构调优转化为统一的参数搜索空间,实现自动化适应性。尤为关键的是,UniEP引入了确定性令牌排序机制,确保即使在激进的重叠调度方案下,也能与顺序执行保持数值一致性。我们在配备NVIDIA Hopper GPU的GPU集群上评估UniEP。结果表明,与现有最优方案相比,UniEP实现了1.03倍至1.38倍的加速,有效缓解了通信瓶颈,同时保持了生产环境大语言模型训练所需的严格精度标准。