The size of deep learning models has been increasing to enhance model quality. The linear increase in training computation budget with model size means that training an extremely large-scale model is exceedingly time-consuming. Recently, the Mixture of Expert (MoE) has drawn significant attention as it can scale models to extra-large sizes with a stable computation budget. However, inefficient distributed training of large-scale MoE models hinders their broader application. Specifically, a considerable dynamic load imbalance occurs among devices during training, significantly reducing throughput. Several load-balancing works have been proposed to address the challenge. System-level solutions draw more attention for their hardware affinity and non-disruption of model convergence compared to algorithm-level ones. However, they are troubled by high communication costs and poor communication-computation overlapping. To address these challenges, we propose a systematic load-balancing method, Pro-Prophet, which consists of a planner and a scheduler for efficient parallel training of large-scale MoE models. To adapt to the dynamic load imbalance, we profile training statistics and use them to design Pro-Prophet. For lower communication volume, Pro-Prophet planner determines a series of lightweight load-balancing strategies and efficiently searches for a communication-efficient one for training based on the statistics. For sufficient overlapping of communication and computation, Pro-Prophet scheduler schedules the data-dependent operations based on the statistics and operation features, further improving the training throughput. Experimental results indicate that Pro-Prophet achieves up to 2.66x speedup compared to Deepspeed-MoE and FasterMoE. Additionally, Pro-Prophet achieves a load-balancing enhancement of up to 11.01x when compared to FasterMoE.
翻译:深度学习模型的规模持续增长以提升模型质量。训练计算预算随模型规模线性增加,这意味着训练超大规模模型极其耗时。近年来,混合专家(Mixture of Experts, MoE)模型因其能在稳定计算预算下将模型扩展至超大规模而受到广泛关注。然而,大规模MoE模型低效的分布式训练阻碍了其更广泛的应用。具体而言,训练过程中设备间会出现显著的动态负载不均衡,从而大幅降低训练吞吐量。已有若干负载均衡研究试图解决这一挑战。与算法级方案相比,系统级解决方案因其硬件亲和性且不破坏模型收敛性而受到更多关注。然而,这些方案受困于高昂的通信开销和较差的通信-计算重叠效率。为应对这些挑战,我们提出了一种系统化的负载均衡方法Pro-Prophet,该方法包含一个规划器和一个调度器,用于实现大规模MoE模型的高效并行训练。为适应动态负载不均衡,我们分析训练统计数据并据此设计Pro-Prophet。为降低通信量,Pro-Prophet规划器确定一系列轻量级负载均衡策略,并基于统计数据高效搜索出通信效率高的策略用于训练。为实现通信与计算的充分重叠,Pro-Prophet调度器基于统计数据与操作特性调度数据依赖型操作,从而进一步提升训练吞吐量。实验结果表明,与Deepspeed-MoE和FasterMoE相比,Pro-Prophet最高可实现2.66倍的加速。此外,与FasterMoE相比,Pro-Prophet最高可实现11.01倍的负载均衡增强。