Data and pipeline parallelism are ubiquitous for training of Large Language Models (LLM) on distributed nodes. Driven by the need for cost-effective training, recent work explores efficient communication arrangement for end to end training. Motivated by LLM's resistance to layer skipping and layer reordering, in this paper, we explore stage (several consecutive layers) skipping in pipeline training, and challenge the conventional practice of sequential pipeline execution. We derive convergence and throughput constraints (guidelines) for pipelining with skipping and swapping pipeline stages. Based on these constraints, we propose SkipPipe, the first partial pipeline framework to reduce the end-to-end training time for LLMs while preserving the convergence. The core of SkipPipe is a path scheduling algorithm that optimizes the paths for individual microbatches and reduces idle time (due to microbatch collisions) on the distributed nodes, complying with the given stage skipping ratio. We extensively evaluate SkipPipe on LLaMa models from 500M to 8B parameters on up to 20 nodes. Our results show that SkipPipe reduces training iteration time by up to $55\%$ compared to full pipeline. Our partial pipeline training also improves resistance to layer omission during inference, experiencing a drop in perplexity of only $7\%$ when running only half the model. Our code is available at https://github.com/gensyn-ai/skippipe.
翻译:数据并行与流水线并行已成为分布式节点上训练大语言模型的普遍范式。受成本效益训练需求的驱动,近期研究致力于探索端到端训练的高效通信编排机制。本文基于大语言模型对层跳过与层重排的耐受特性,探索流水线训练中的阶段(若干连续层)跳过策略,并对传统顺序流水线执行模式提出挑战。我们推导了支持阶段跳过与交换的流水线训练的收敛性与吞吐量约束条件。基于这些约束,我们提出SkipPipe——首个在保证收敛性的同时降低大语言模型端到端训练时间的部分流水线框架。SkipPipe的核心是路径调度算法,该算法在满足给定阶段跳过比例的前提下,为独立微批次优化传输路径,并减少分布式节点因微批次冲突产生的空闲时间。我们在LLaMa模型系列(参数量从500M到8B)上使用多达20个节点对SkipPipe进行广泛评估。实验结果表明,与完整流水线相比,SkipPipe最高可降低55%的训练迭代时间。我们的部分流水线训练还增强了推理过程中层省略的鲁棒性:当仅运行半数模型层时,困惑度仅下降7%。代码已开源:https://github.com/gensyn-ai/skippipe。