Training modern machine learning models increasingly requires computation to be distributed across many accelerators. Data parallelism remains the default choice and is often paired with tensor-parallel sharding, but model parallelism becomes unavoidable once parameters, activations, or optimizer states no longer fit on a single device. This paper studies pipeline model parallelism through the lens of PipeDream (PD) (Harlap et al., 2018). Our first contribution is theoretical: we introduce Randomized PipeDream (RPD), a stale block-SGD abstraction that yields, to our knowledge, the first clean nonconvex convergence guarantee for a PD-style method. Our second contribution is a scaling diagnosis: we prove that the delay induced by steady-state PD grows as $S^2 - S/2 + O(1)$ for $S$ stages, so the stale-read contribution in the convergence theorem scales as $Θ(γ^2 S^4)$, equivalently as $Θ(S^4/K)$ in the tuned-rate form. Our third contribution is a comparison with LocalSGD, whose periodic model averaging trades weight staleness for synchronization bubbles. In our reported simulated-time experiments, the better-performing method depends on the objective: PD performs better on the quadratic objective and on a small language-modeling training-loss task, while for logistic regression LocalSGD becomes superior as the number of stages increases.
翻译:训练现代机器学习模型日益需要跨多个加速器分布式计算。数据并行仍是默认选择,通常与张量并行分片相结合,但当参数、激活值或优化器状态不再能容纳于单个设备时,模型并行就不可避免。本文通过PipeDream(PD)(Harlap等人,2018)的视角研究管道模型并行。我们的首个贡献是理论性的:我们引入Randomized PipeDream(RPD),一种陈旧块SGD抽象,据我们所知,这为PD风格方法提供了首个清晰的非凸收敛保证。我们的第二个贡献是规模诊断:我们证明,稳态PD引起的延迟随阶段数$S$增长为$S^2 - S/2 + O(1)$,因此收敛定理中的陈旧读取贡献规模为$Θ(γ^2 S^4)$,在调优速率形式下等价于$Θ(S^4/K)$。我们的第三个贡献是与LocalSGD的比较,后者的周期性模型平均以权重陈旧性换取同步气泡。在我们报告的模拟时间实验中,表现更优的方法取决于目标:PD在二次目标和小型语言模型训练损失任务上表现更好,而对于逻辑回归,随着阶段数增加,LocalSGD变得更为优越。