Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

The emergence of large language models (LLMs) relies heavily on distributed training strategies, among which pipeline parallelism plays a crucial role. As LLMs' training sequence length extends to 32k or even 128k, the current pipeline parallel methods face severe bottlenecks, including high memory footprints and substantial pipeline bubbles, greatly hindering model scalability and training throughput. To enhance memory efficiency and training throughput, in this work, we introduce an efficient sequence-level one-forward-one-backward (1F1B) pipeline scheduling method tailored for training LLMs on long sequences named Seq1F1B. Seq1F1B decomposes batch-level schedulable units into finer sequence-level units, reducing bubble size and memory footprint. Considering that Seq1F1B may produce slight extra bubbles if sequences are split evenly, we design a computation-wise strategy to partition input sequences and mitigate this side effect. Compared to competitive pipeline baseline methods such as Megatron 1F1B pipeline parallelism, our method achieves higher training throughput with less memory footprint. Notably, Seq1F1B efficiently trains a LLM with 30B parameters on sequences up to 64k using 64 NVIDIA A100 GPUs without recomputation strategies, a feat unachievable with existing methods. Our source code is based on Megatron-LM, and now is avaiable at: https://github.com/MayDomine/Seq1F1B.git.

翻译：大语言模型（LLM）的兴起在很大程度上依赖于分布式训练策略，其中流水线并行发挥着关键作用。随着LLM训练序列长度扩展至32k甚至128k，现有的流水线并行方法面临严峻瓶颈，包括高内存占用和显著的流水线气泡，极大地阻碍了模型可扩展性和训练吞吐量。为提升内存效率和训练吞吐量，本文提出一种专为长序列LLM训练设计的高效序列级“一前向一反向”（1F1B）流水线调度方法，命名为Seq1F1B。该方法将批处理级可调度单元分解为更细粒度的序列级单元，从而减少气泡大小和内存占用。考虑到均匀分割序列可能产生轻微额外气泡，我们设计了一种基于计算量的序列划分策略以缓解此副作用。与Megatron 1F1B流水线并行等主流基线方法相比，本方法能以更低内存占用实现更高训练吞吐量。值得注意的是，Seq1F1B可在不使用重计算策略的情况下，仅用64块NVIDIA A100 GPU高效训练参数量达300亿、序列长度达64k的LLM，这是现有方法无法实现的。我们的源代码基于Megatron-LM框架，现已开源：https://github.com/MayDomine/Seq1F1B.git。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日