Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by $1.96\times$ and for end-to-end execution, COMET delivers a $1.71\times$ speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.
翻译:专家混合模型已被广泛用于将大型语言模型扩展至万亿以上参数,同时保持固定的计算成本。在分布式场景中开发大型专家混合模型面临通信开销巨大的问题。在主流模型与框架下,专家混合模型层的设备间通信可占据整个模型执行时间的47%。因此,现有方法建议将专家混合模型层的通信与计算进行流水线化以实现重叠。然而,这些粗粒度的重叠方案会显著损害计算效率,且延迟隐藏效果欠佳。为此,我们提出COMET——一种具有细粒度通信-计算重叠的优化专家混合模型系统。通过数据依赖性分析与任务重调度,COMET实现了通信与计算的精确细粒度重叠。借助自适应工作负载分配,COMET有效消除了细粒度通信瓶颈,并增强了其在多种场景下的适应性。评估结果表明,COMET将单层专家混合模型的执行速度提升至$1.96\times$;在端到端执行中,COMET平均实现$1.71\times$的加速比。COMET已在万卡规模GPU集群的生产环境中部署,累计节省数百万GPU时。