Demystifying Pipeline Parallelism: First Theory for PipeDream

Training modern machine learning models increasingly requires computation to be distributed across many accelerators. Data parallelism remains the default choice and is often paired with tensor-parallel sharding, but model parallelism becomes unavoidable once parameters, activations, or optimizer states no longer fit on a single device. This paper studies pipeline model parallelism through the lens of PipeDream (PD) (Harlap et al., 2018). Our first contribution is theoretical: we introduce Randomized PipeDream (RPD), a stale block-SGD abstraction that yields, to our knowledge, the first clean nonconvex convergence guarantee for a PD-style method. Our second contribution is a scaling diagnosis: we prove that the delay induced by steady-state PD grows as $S^2 - S/2 + O(1)$ for $S$ stages, so the stale-read contribution in the convergence theorem scales as $Θ(γ^2 S^4)$, equivalently as $Θ(S^4/K)$ in the tuned-rate form. Our third contribution is a comparison with LocalSGD, whose periodic model averaging trades weight staleness for synchronization bubbles. In our reported simulated-time experiments, the better-performing method depends on the objective: PD performs better on the quadratic objective and on a small language-modeling training-loss task, while for logistic regression LocalSGD becomes superior as the number of stages increases.

翻译：训练现代机器学习模型日益需要跨多个加速器分布式计算。数据并行仍是默认选择，通常与张量并行分片相结合，但当参数、激活值或优化器状态不再能容纳于单个设备时，模型并行就不可避免。本文通过PipeDream（PD）（Harlap等人，2018）的视角研究管道模型并行。我们的首个贡献是理论性的：我们引入Randomized PipeDream（RPD），一种陈旧块SGD抽象，据我们所知，这为PD风格方法提供了首个清晰的非凸收敛保证。我们的第二个贡献是规模诊断：我们证明，稳态PD引起的延迟随阶段数$S$增长为$S^2 - S/2 + O(1)$，因此收敛定理中的陈旧读取贡献规模为$Θ(γ^2 S^4)$，在调优速率形式下等价于$Θ(S^4/K)$。我们的第三个贡献是与LocalSGD的比较，后者的周期性模型平均以权重陈旧性换取同步气泡。在我们报告的模拟时间实验中，表现更优的方法取决于目标：PD在二次目标和小型语言模型训练损失任务上表现更好，而对于逻辑回归，随着阶段数增加，LocalSGD变得更为优越。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【博士论文】模型合并：理论基础与算法研究

专知会员服务

15+阅读 · 5月7日

大语言模型时代下的模型合并：方法、应用与未来方向

专知会员服务

14+阅读 · 3月11日