Increasing the context length of large language models (LLMs) unlocks fundamentally new capabilities, but also significantly increases the memory footprints of training. Previous model-parallel systems such as Megatron-LM partition and compute different attention heads in parallel, resulting in large communication volumes, so they cannot scale beyond the number of attention heads, thereby hindering its adoption. In this paper, we introduce a new approach, LightSeq, for long-context LLMs training. LightSeq has many notable advantages. First, LightSeq partitions over the sequence dimension, hence is agnostic to model architectures and readily applicable for models with varying numbers of attention heads, such as Multi-Head, Multi-Query and Grouped-Query attention. Second, LightSeq not only requires up to 4.7x less communication than Megatron-LM on popular LLMs but also overlaps the communication with computation. To further reduce the training time, LightSeq features a novel gradient checkpointing scheme to bypass an forward computation for memory-efficient attention. We evaluate LightSeq on Llama-7B and its variants with sequence lengths from 32K to 512K. Through comprehensive experiments on single and cross-node training, we show that LightSeq achieves up to 1.24-2.01x end-to-end speedup, and a 2-8x longer sequence length on models with fewer heads, compared to Megatron-LM. Codes will be available at https://github.com/RulinShao/LightSeq.
翻译:增大大型语言模型(LLMs)的上下文长度能解锁全新的能力,但同时也显著增加了训练过程中的内存占用。此前诸如Megatron-LM等模型并行系统通过对不同注意力头进行分区并行计算,导致通信量庞大且无法扩展到超过注意力头数量的规模,从而阻碍了其应用。本文提出了一种面向长上下文LLM训练的新方法LightSeq。LightSeq具有多项显著优势:首先,该方法在序列维度上进行数据分区,因此对模型架构具有无关性,可轻松适配具有不同注意力头数量的模型(如多头注意力、多查询注意力和分组查询注意力)。其次,与Megatron-LM相比,LightSeq在主流LLM上的通信量最多可降低4.7倍,同时能实现通信与计算的重叠。为进一步缩短训练时间,LightSeq设计了一种新颖的梯度检查点方案,可绕过内存高效注意力中的一次前向计算。我们在Llama-7B及其变体上对序列长度从32K到512K的场景进行了评估。通过单节点与跨节点的全面实验表明:与Megatron-LM相比,LightSeq可实现端到端1.24-2.01倍的加速比,并在注意力头较少的模型上支持2-8倍更长的序列长度。相关代码将发布于https://github.com/RulinShao/LightSeq。