Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or efficiency issues. We propose LoongTrain, a novel system to efficiently train LLMs with long sequences at scale. The core of LoongTrain is the 2D-Attention mechanism, which combines both head-parallel and context-parallel techniques to break the scalability constraints while maintaining efficiency. We introduce Double-Ring-Attention and analyze the performance of device placement strategies to further speed up training. We implement LoongTrain with the hybrid ZeRO and Selective Checkpoint++ techniques. Experiment results show that LoongTrain outperforms state-of-the-art baselines, i.e., DeepSpeed-Ulysses and Megatron Context Parallelism, in both end-to-end training speed and scalability, and improves Model FLOPs Utilization (MFU) by up to 2.88x.
翻译:高效训练长序列大语言模型具有重要意义,但巨大的计算与内存需求构成了严峻挑战。序列并行技术虽被提出以应对这些问题,但现有方法存在可扩展性或效率方面的局限。本文提出LoongTrain——一种可扩展高效训练长序列大语言模型的新型系统。其核心是2D注意力机制,该机制融合头并行与上下文并行技术,在保持效率的同时突破可扩展性限制。我们进一步提出双环注意力机制,并通过分析设备放置策略的性能以加速训练过程。系统实现结合了混合ZeRO与Selective Checkpoint++技术。实验结果表明,LoongTrain在端到端训练速度与可扩展性方面均优于当前最先进的基线方法(即DeepSpeed-Ulysses与Megatron上下文并行),并将模型浮点运算利用率最高提升至基准的2.88倍。