As the scale of models and training data continues to grow, there is an expanding reliance on more GPUs to train large-scale models, which inevitably increases the likelihood of encountering dynamic stragglers that some devices lag behind in performance occasionally. However, hybrid parallel training, one of the de facto paradigms to train large models, is typically sensitive to the stragglers. This paper presents Malleus, a straggler-resilient hybrid parallel training framework for large-scale models. Malleus captures the dynamic straggler issues at the nuanced, per-GPU granularity during training. Once a shift in the GPU ability is detected, Malleus adaptively adjusts the parallelization of GPU devices, pipeline stages, model layers, and training data through a novel planning algorithm, accommodating the dynamic stragglers in real time. In addition, Malleus seamlessly and efficiently migrates the model states to fulfill the adjusted parallelization plan on the fly, without sacrificing the stability of the training tasks. Empirical results on large language models with up to 110B parameters show that Malleus consistently outperforms existing parallel training frameworks under various straggler situations, delivering on average 2.63-5.28 times of efficiency improvement.
翻译:随着模型规模和训练数据的持续增长,训练大规模模型对更多GPU的依赖日益加深,这不可避免地增加了遭遇动态滞后的可能性,即某些设备偶尔性能落后。然而,混合并行训练作为训练大模型的事实范式之一,通常对滞后现象较为敏感。本文提出Malleus,一个面向大规模模型的抗滞后混合并行训练框架。Malleus在训练过程中以细粒度的、每GPU的精度捕捉动态滞后问题。一旦检测到GPU能力发生变化,Malleus通过一种新颖的规划算法自适应地调整GPU设备、流水线阶段、模型层和训练数据的并行化配置,从而实时适应动态滞后。此外,Malleus能够无缝且高效地迁移模型状态,以即时满足调整后的并行化计划,且不牺牲训练任务的稳定性。在参数规模高达1100亿的大型语言模型上的实验结果表明,在各种滞后场景下,Malleus始终优于现有的并行训练框架,平均带来2.63至5.28倍的效率提升。