Pipeline parallelism is essential for large-scale model training, but existing asynchronous approaches often degrade convergence due to parameter mismatch between forward and backward passes. We propose Asynchronous Multi-Directional Pipeline parallelism (AMDP) to mitigate this issue while sustaining high utilization. AMDP limits the first stage of each pipeline to process at most two minibatches before backpropagation, bounding the number of parameter updates between forward and backward passes. To alleviate the resulting pipeline bubbles, AMDP launches multiple concurrent pipelines and adapts their number according to pipeline depth. In addition, AMDP accumulates gradients across minibatches and applies them in a single update, ensuring that only a bounded number of minibatches experience parameter mismatch, limited to within one optimization step. Experiments on GPT- and BERT-style models demonstrate that AMDP significantly accelerates training while preserving convergence.
翻译:流水线并行对大规模模型训练至关重要,但现有异步方法常因前向传播与反向传播间的参数失配导致收敛性能下降。我们提出异步多方向流水线并行(AMDP)以在维持高利用率的同时缓解该问题。AMDP限制每条流水线的第一阶段在反向传播前最多处理两个小批量,从而将前向与反向传播间的参数更新次数限定在可控范围内。为缓解由此产生的流水线气泡,AMDP启动多条并发流水线,并根据流水线深度动态调整其数量。此外,AMDP跨小批量累积梯度并实现单次更新,确保仅有限数量的小批量(限制在一个优化步骤内)经历参数失配。在GPT与BERT风格模型上的实验表明,AMDP在保持收敛性的同时显著加速了训练过程。