Malleus：通过可塑数据与模型并行实现大规模模型的抗滞后混合并行训练 (Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization)

As the scale of models and training data continues to grow, there is an expanding reliance on more GPUs to train large-scale models, which inevitably increases the likelihood of encountering dynamic stragglers that some devices lag behind in performance occasionally. However, hybrid parallel training, one of the de facto paradigms to train large models, is typically sensitive to the stragglers. This paper presents Malleus, a straggler-resilient hybrid parallel training framework for large-scale models. Malleus captures the dynamic straggler issues at the nuanced, per-GPU granularity during training. Once a shift in the GPU ability is detected, Malleus adaptively adjusts the parallelization of GPU devices, pipeline stages, model layers, and training data through a novel planning algorithm, accommodating the dynamic stragglers in real time. In addition, Malleus seamlessly and efficiently migrates the model states to fulfill the adjusted parallelization plan on the fly, without sacrificing the stability of the training tasks. Empirical results on large language models with up to 110B parameters show that Malleus consistently outperforms existing parallel training frameworks under various straggler situations, delivering on average 2.63-5.28 times of efficiency improvement.

翻译：随着模型规模和训练数据的持续增长，训练大规模模型对更多GPU的依赖日益加深，这不可避免地增加了遭遇动态滞后的可能性，即某些设备偶尔性能落后。然而，混合并行训练作为训练大模型的事实范式之一，通常对滞后现象较为敏感。本文提出Malleus，一个面向大规模模型的抗滞后混合并行训练框架。Malleus在训练过程中以细粒度的、每GPU的精度捕捉动态滞后问题。一旦检测到GPU能力发生变化，Malleus通过一种新颖的规划算法自适应地调整GPU设备、流水线阶段、模型层和训练数据的并行化配置，从而实时适应动态滞后。此外，Malleus能够无缝且高效地迁移模型状态，以即时满足调整后的并行化计划，且不牺牲训练任务的稳定性。在参数规模高达1100亿的大型语言模型上的实验结果表明，在各种滞后场景下，Malleus始终优于现有的并行训练框架，平均带来2.63至5.28倍的效率提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日