Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the state-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach, which is more robust to transformer model architectures and network hardware topology. This paper compares the communication and memory cost of SP and existing parallelism, including data/tensor/zero/pipeline parallelism, and discusses the best practices for designing hybrid 4D parallelism involving SP. We achieved 47% MFU on two 8xA800 nodes using SP for the LLAMA3-8B model training using sequence length 208K. Our code is publicly available at https://github.com/feifeibear/long-context-attention.
翻译:序列并行(SP)通过将输入张量的序列维度划分到多个计算设备上,正成为释放生成式AI模型长上下文能力的关键技术。本文研究了当前最先进的序列并行方法,即DeepSpeed-Ulysses与Ring-Attention,并提出了一种统一的序列并行方法,该方法对Transformer模型架构与网络硬件拓扑具有更强的鲁棒性。本文比较了序列并行与现有并行策略(包括数据/张量/零/流水线并行)的通信与内存开销,并探讨了设计包含序列并行在内的混合四维并行策略的最佳实践。在使用序列长度208K对LLAMA3-8B模型进行训练时,我们在两个8xA800节点上通过序列并行实现了47%的模型浮点运算利用率(MFU)。我们的代码已在https://github.com/feifeibear/long-context-attention 公开。