Pipeline parallelism has been widely explored, but most existing schedules lack a systematic methodology. In this paper, we propose a framework to decompose pipeline schedules as repeating a building block and we show that the lifespan of the building block decides the peak activation memory of the pipeline schedule. Guided by the observations, we find that almost all existing pipeline schedules, to the best of our knowledge, are memory inefficient. To address this, we introduce a family of memory efficient building blocks with controllable activation memory, which can reduce the peak activation memory to 1/2 of 1F1B without sacrificing efficiency, and even to 1/3 with comparable throughput. We can also achieve almost zero pipeline bubbles while maintaining the same activation memory as 1F1B. Our evaluations demonstrate that in pure pipeline parallelism settings, our methods outperform 1F1B by from 7% to 55% in terms of throughput. When employing a grid search over hybrid parallelism hyperparameters in practical scenarios, our proposed methods demonstrate a 16% throughput improvement over the 1F1B baseline for large language models.
翻译:流水线并行已被广泛探索,但现有大多数调度方案缺乏系统化的方法论。本文提出一个框架,将流水线调度分解为重复执行的基本构建块,并证明该构建块的生命周期决定了流水线调度的峰值激活内存。基于此观察,我们发现几乎所有现有流水线调度方案(据我们所知)都存在内存效率低下的问题。为此,我们引入了一系列具有可控激活内存的高效内存构建块,这些构建块可将峰值激活内存降低至1F1B方案的1/2且不牺牲效率,甚至可在保持相近吞吐量的情况下降至1/3。我们还能在维持与1F1B相同激活内存的同时实现几乎为零的流水线气泡。实验评估表明,在纯流水线并行设置中,我们的方法在吞吐量上相比1F1B提升了7%至55%。在实际场景中对混合并行超参数进行网格搜索时,我们提出的方法针对大语言模型实现了比1F1B基线高16%的吞吐量提升。