The emergence of large language models (LLMs) relies heavily on distributed training strategies, among which pipeline parallelism plays a crucial role. As LLMs' training sequence length extends to 32k or even 128k, the current pipeline parallel methods face severe bottlenecks, including high memory footprints and substantial pipeline bubbles, greatly hindering model scalability and training throughput. To enhance memory efficiency and training throughput, in this work, we introduce an efficient sequence-level one-forward-one-backward (1F1B) pipeline scheduling method tailored for training LLMs on long sequences named Seq1F1B. Seq1F1B decomposes batch-level schedulable units into finer sequence-level units, reducing bubble size and memory footprint. Considering that Seq1F1B may produce slight extra bubbles if sequences are split evenly, we design a computation-wise strategy to partition input sequences and mitigate this side effect. Compared to competitive pipeline baseline methods such as Megatron 1F1B pipeline parallelism, our method achieves higher training throughput with less memory footprint. Notably, Seq1F1B efficiently trains a LLM with 30B parameters on sequences up to 64k using 64 NVIDIA A100 GPUs without recomputation strategies, a feat unachievable with existing methods. Our source code is based on Megatron-LM, and now is avaiable at: https://github.com/MayDomine/Seq1F1B.git.
翻译:大语言模型(LLMs)的出现极大地依赖于分布式训练策略,其中流水线并行发挥着关键作用。随着LLMs的训练序列长度扩展至32k甚至128k,现有的流水线并行方法面临着严峻的瓶颈,包括高内存占用和显著的流水线气泡,严重阻碍了模型的可扩展性和训练吞吐量。为提升内存效率和训练吞吐量,本文提出一种专为长序列LLMs训练设计的高效序列级“一次前向-一次后向”(1F1B)流水线调度方法,命名为Seq1F1B。Seq1F1B将批次级可调度单元分解为更细粒度的序列级单元,从而减小气泡大小和内存占用。考虑到若均匀分割序列,Seq1F1B可能产生轻微额外气泡,我们设计了一种基于计算量的序列划分策略以缓解此副作用。与Megatron 1F1B流水线并行等竞争性流水线基线方法相比,我们的方法以更少的内存占用实现了更高的训练吞吐量。值得注意的是,Seq1F1B能够在64个NVIDIA A100 GPU上高效训练参数量达30B、序列长度高达64k的LLM,且无需采用重计算策略,这是现有方法无法实现的。我们的源代码基于Megatron-LM,现已公开于:https://github.com/MayDomine/Seq1F1B.git。