The emergence of large language models (LLMs) relies heavily on distributed training strategies, among which pipeline parallelism plays a crucial role. As LLMs' training sequence length extends to 32k or even 128k, the current pipeline parallel methods face severe bottlenecks, including high memory footprints and substantial pipeline bubbles, greatly hindering model scalability and training throughput. To enhance memory efficiency and training throughput, in this work, we introduce an efficient sequence-level one-forward-one-backward (1F1B) pipeline scheduling method tailored for training LLMs on long sequences named Seq1F1B. Seq1F1B decomposes batch-level schedulable units into finer sequence-level units, reducing bubble size and memory footprint. Considering that Seq1F1B may produce slight extra bubbles if sequences are split evenly, we design a computation-wise strategy to partition input sequences and mitigate this side effect. Compared to competitive pipeline baseline methods such as Megatron 1F1B pipeline parallelism, our method achieves higher training throughput with less memory footprint. Notably, Seq1F1B efficiently trains a LLM with 30B parameters on sequences up to 64k using 64 NVIDIA A100 GPUs without recomputation strategies, a feat unachievable with existing methods. Our source code is based on Megatron-LM, and now is avaiable at: https://github.com/MayDomine/Seq1F1B.git.
翻译:大语言模型(LLM)的兴起在很大程度上依赖于分布式训练策略,其中流水线并行发挥着关键作用。随着LLM训练序列长度扩展至32k甚至128k,现有的流水线并行方法面临严峻瓶颈,包括高内存占用和显著的流水线气泡,极大地阻碍了模型可扩展性和训练吞吐量。为提升内存效率和训练吞吐量,本文提出一种专为长序列LLM训练设计的高效序列级“一前向一反向”(1F1B)流水线调度方法,命名为Seq1F1B。该方法将批处理级可调度单元分解为更细粒度的序列级单元,从而减少气泡大小和内存占用。考虑到均匀分割序列可能产生轻微额外气泡,我们设计了一种基于计算量的序列划分策略以缓解此副作用。与Megatron 1F1B流水线并行等主流基线方法相比,本方法能以更低内存占用实现更高训练吞吐量。值得注意的是,Seq1F1B可在不使用重计算策略的情况下,仅用64块NVIDIA A100 GPU高效训练参数量达300亿、序列长度达64k的LLM,这是现有方法无法实现的。我们的源代码基于Megatron-LM框架,现已开源:https://github.com/MayDomine/Seq1F1B.git。