Pipeline parallelism has been widely explored, but most existing schedules lack a systematic methodology. In this paper, we propose a framework to decompose pipeline schedules as repeating a building block, and show that the lifespan of the building block decides the peak activation memory of the pipeline schedule. Guided by the observations, we find that almost all existing pipeline schedules, to the best of our knowledge, are memory inefficient. To address this, we introduce a family of memory efficient building blocks with controllable activation memory, which can reduce the peak activation memory to 1/2 of 1F1B without sacrificing efficiency, and even to 1/3 with comparable throughput. We can also achieve almost zero pipeline bubbles while maintaining the same activation memory as 1F1B. Our evaluations demonstrate that in pure pipeline parallelism settings, our methods outperform 1F1B by from 7% to 55% in terms of throughput. When employing a grid search over hybrid parallelism hyperparameters in practical scenarios, our methods demonstrate a 16% throughput improvement over the 1F1B baseline for large language models. The implementation is open-sourced at https://github.com/sail-sg/zero-bubble-pipeline-parallelism.
翻译:流水线并行已被广泛探索,但现有大多数调度方案缺乏系统化的方法论。本文提出一种将流水线调度分解为重复基本构建块的框架,并证明该构建块的生命周期决定了流水线调度的峰值激活内存。基于此观察,我们发现几乎所有现有流水线调度方案(据我们所知)都存在内存效率低下的问题。为解决此问题,我们引入了一系列具有可控激活内存的高效内存构建块,可将峰值激活内存降低至1F1B方案的1/2且不牺牲效率,甚至可降至1/3同时保持相当的吞吐量。我们还能在维持与1F1B相同激活内存的同时实现近乎零流水线气泡。实验评估表明,在纯流水线并行设置中,我们的方法在吞吐量上相较1F1B有7%至55%的提升。在实际场景中对混合并行超参数进行网格搜索时,对于大语言模型,我们的方法相比1F1B基线实现了16%的吞吐量提升。实现代码已开源:https://github.com/sail-sg/zero-bubble-pipeline-parallelism。