FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement

With the increasing data volume, there is a trend of using large-scale pre-trained models to store the knowledge into an enormous number of model parameters. The training of these models is composed of lots of dense algebras, requiring a huge amount of hardware resources. Recently, sparsely-gated Mixture-of-Experts (MoEs) are becoming more popular and have demonstrated impressive pretraining scalability in various downstream tasks. However, such a sparse conditional computation may not be effective as expected in practical systems due to the routing imbalance and fluctuation problems. Generally, MoEs are becoming a new data analytics paradigm in the data life cycle and suffering from unique challenges at scales, complexities, and granularities never before possible. In this paper, we propose a novel DNN training framework, FlexMoE, which systematically and transparently address the inefficiency caused by dynamic dataflow. We first present an empirical analysis on the problems and opportunities of training MoE models, which motivates us to overcome the routing imbalance and fluctuation problems by a dynamic expert management and device placement mechanism. Then we introduce a novel scheduling module over the existing DNN runtime to monitor the data flow, make the scheduling plans, and dynamically adjust the model-to-hardware mapping guided by the real-time data traffic. A simple but efficient heuristic algorithm is exploited to dynamically optimize the device placement during training. We have conducted experiments on both NLP models (e.g., BERT and GPT) and vision models (e.g., Swin). And results show FlexMoE can achieve superior performance compared with existing systems on real-world workloads -- FlexMoE outperforms DeepSpeed by 1.70x on average and up to 2.10x, and outperforms FasterMoE by 1.30x on average and up to 1.45x.

翻译：随着数据量的不断增长，利用大规模预训练模型将知识存储在大量模型参数中已成为趋势。此类模型的训练包含大量密集代数运算，需消耗巨额硬件资源。近年来，稀疏门控混合专家模型（MoEs）日益流行，并在各种下游任务中展现出显著的预训练可扩展性。然而，由于路由不平衡与波动问题，这种稀疏条件计算在实际系统中可能无法达到预期效果。通常，MoEs正成为数据生命周期中的新数据分析范式，并面临前所未有的规模、复杂度和粒度层面的独特挑战。本文提出一种新型深度神经网络（DNN）训练框架FlexMoE，系统性地透明解决动态数据流引发的效率低下问题。我们首先通过实证分析揭示训练MoE模型的问题与机遇，这促使我们通过动态专家管理与设备放置机制克服路由不平衡与波动问题。随后引入一种新型调度模块，该模块基于现有DNN运行时环境监控数据流、制定调度计划，并根据实时数据流量动态调整模型到硬件的映射。我们采用一种简单但高效的启发式算法，在训练过程中动态优化设备放置。我们在自然语言处理模型（如BERT和GPT）以及视觉模型（如Swin）上进行了实验。结果表明，FlexMoE在实际工作负载下能实现优于现有系统的性能——相较于DeepSpeed，平均加速1.70倍，最高达2.10倍；相较于FasterMoE，平均加速1.30倍，最高达1.45倍。