AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training

Pipeline parallelism is essential for large-scale model training, but existing asynchronous approaches often degrade convergence due to parameter mismatch between forward and backward passes. We propose Asynchronous Multi-Directional Pipeline parallelism (AMDP) to mitigate this issue while sustaining high utilization. AMDP limits the first stage of each pipeline to process at most two minibatches before backpropagation, bounding the number of parameter updates between forward and backward passes. To alleviate the resulting pipeline bubbles, AMDP launches multiple concurrent pipelines and adapts their number according to pipeline depth. In addition, AMDP accumulates gradients across minibatches and applies them in a single update, ensuring that only a bounded number of minibatches experience parameter mismatch, limited to within one optimization step. Experiments on GPT- and BERT-style models demonstrate that AMDP significantly accelerates training while preserving convergence.

翻译：流水线并行对大规模模型训练至关重要，但现有异步方法常因前向传播与反向传播间的参数失配导致收敛性能下降。我们提出异步多方向流水线并行（AMDP）以在维持高利用率的同时缓解该问题。AMDP限制每条流水线的第一阶段在反向传播前最多处理两个小批量，从而将前向与反向传播间的参数更新次数限定在可控范围内。为缓解由此产生的流水线气泡，AMDP启动多条并发流水线，并根据流水线深度动态调整其数量。此外，AMDP跨小批量累积梯度并实现单次更新，确保仅有限数量的小批量（限制在一个优化步骤内）经历参数失配。在GPT与BERT风格模型上的实验表明，AMDP在保持收敛性的同时显著加速了训练过程。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【AAAI2025】穿越多模态领域：通过低秩序列多模态适配器实现高效迁移学习

专知会员服务

14+阅读 · 2024年12月13日

高效训练大模型技术

专知会员服务

41+阅读 · 2024年11月13日

【NeurIPS2023】跨模态提示：适应大型预训练模型用于音频-视觉下游任务

专知会员服务

29+阅读 · 2023年11月11日

【NeurIPS2023】MultiModN:多模态，多任务，可解释的模块化网络

专知会员服务

40+阅读 · 2023年9月27日