Mixture-of-Experts (MoE) has become a popular architecture for scaling large models. However, the rapidly growing scale outpaces model training on a single DC, driving a shift toward a more flexible, cross-DC training paradigm. Under this, Expert Parallelism (EP) of MoE faces significant scalability issues due to the limited cross-DC bandwidth. Specifically, existing EP optimizations attempt to overlap data communication and computation, which has little benefit in low-bandwidth scenarios due to a much longer data communication time. Therefore, the trends of cross-DC EP scaling is fast becoming a critical roadblock to the continued growth of MoE models. To address this, we propose HybridEP, a modeling-guided framework to optimize EP under constrained bandwidth. Our key idea is to dynamically transform the spatial placement of experts to reduce data communication traffic and frequency, thereby minimizing EP's communication overheads. However, it is non-trivial to find the optimal solution because it complicates the original communication pattern by mixing data and expert communication. We therefore build a stream-based model to determine the optimal transmission ratio. Guided by this, we incorporate two techniques: (1) domain-based partition to construct the mapping between hybrid patterns and specific communication topology at GPU level, and (2) parameter-efficient migration to further refine this topology by reducing expert transmission overhead and enlarging the domain size. Combining all these designs, HybridEP can be considered as a more general EP with better scalability. Experimental results show that HybridEP outperforms existing state-of-the-art MoE training systems by up to 5.6x under constrained bandwidth. We further compare HybridEP and EP on large-scale simulations. HybridEP achieves up to 1.45x speedup with 1k DCs under different bandwidths.
翻译:混合专家模型已成为扩展大型模型的流行架构。然而,模型规模的快速增长已超越单数据中心训练能力,推动着向更灵活的跨数据中心训练范式转变。在此背景下,混合专家模型的专家并行因跨数据中心带宽限制而面临显著的扩展性问题。具体而言,现有专家并行优化试图重叠数据通信与计算,但在低带宽场景中因通信时间大幅延长而收效甚微。因此,跨数据中心专家并行扩展趋势正快速成为混合专家模型持续增长的关键瓶颈。为解决此问题,我们提出混合专家并行——一个基于建模指导的框架,用于在受限带宽下优化专家并行。其核心思想是通过动态调整专家空间布局来降低数据传输流量与频率,从而最小化专家并行的通信开销。然而,寻找最优解并非易事,因为混合数据与专家通信会使原始通信模式复杂化。为此,我们构建了基于流传输的模型来确定最优传输比例。在此模型指导下,我们整合了两项技术:(1)基于域的分区方法,在GPU层级构建混合模式与特定通信拓扑的映射关系;(2)参数高效迁移技术,通过降低专家传输开销和扩大域规模来进一步优化该拓扑。综合这些设计,混合专家并行可视为具备更优扩展性的通用专家并行方案。实验结果表明,在受限带宽下,混合专家并行性能超越现有最先进的混合专家训练系统达5.6倍。我们进一步通过大规模仿真对比混合专家并行与标准专家并行:在不同带宽条件下,混合专家并行在千级数据中心规模上可实现最高1.45倍的加速比。