The emergence of large language models (LLMs) relies heavily on distributed training strategies, among which pipeline parallelism plays a crucial role. As LLMs' training sequence length extends to 32k or even 128k, the current pipeline parallel methods face severe bottlenecks, including high memory footprints and substantial pipeline bubbles, greatly hindering model scalability and training throughput. To enhance memory efficiency and training throughput, in this work, we introduce an efficient sequence-level one-forward-one-backward (1F1B) pipeline scheduling method tailored for training LLMs on long sequences named Seq1F1B. Seq1F1B decomposes batch-level schedulable units into finer sequence-level units, reducing bubble size and memory footprint. Considering that Seq1F1B may produce slight extra bubbles if sequences are split evenly, we design a computation-wise strategy to partition input sequences and mitigate this side effect. Compared to competitive pipeline baseline methods such as Megatron 1F1B pipeline parallelism, our method achieves higher training throughput with less memory footprint. Notably, Seq1F1B efficiently trains a LLM with 30B parameters on sequences up to 64k using 64 NVIDIA A100 GPUs without recomputation strategies, a feat unachievable with existing methods. Our source code is based on Megatron-LM, and now is avaiable at: https://github.com/MayDomine/Seq1F1B.git.
翻译:大语言模型(LLM)的出现高度依赖于分布式训练策略,其中流水线并行扮演着关键角色。随着LLM训练序列长度扩展至32k甚至128k,当前流水线并行方法面临严重瓶颈,包括高内存占用和大量流水线气泡,极大阻碍了模型扩展性与训练吞吐量。为提升内存效率与训练吞吐量,本文提出一种面向长序列LLM训练的高效序列级前向-后向(1F1B)流水线调度方法,命名为Seq1F1B。Seq1F1B将批级可调度单元分解为更细粒度的序列级单元,从而降低气泡规模与内存占用。针对序列均匀切分可能产生的轻微额外气泡问题,我们设计了一种基于计算量的策略来划分输入序列以缓解此副作用。与竞争性流水线基线方法(如Megatron 1F1B流水线并行)相比,本方法在更低内存占用下实现了更高训练吞吐量。值得注意的是,Seq1F1B可在64块NVIDIA A100 GPU上无需重计算策略高效训练参数规模达300亿、序列长度达64k的LLM,而现有方法无法实现此目标。我们的源代码基于Megatron-LM,现已开源于:https://github.com/MayDomine/Seq1F1B.git。