Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the state-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach, which is more robust to transformer model architectures and network hardware topology. This paper compares the communication and memory cost of SP and existing parallelism, including data/tensor/zero/expert/pipeline parallelism, and discusses the best practices for designing hybrid 4D parallelism involving SP. We achieved 86\% MFU on two 8xA800 nodes using SP for sequence length 208K for the LLAMA3-8B model. Our code is publicly available on \url{https://github.com/feifeibear/long-context-attention}.
翻译:序列并行(SP)通过将输入张量的序列维度划分到多个计算设备上,正成为解锁生成式AI模型长上下文能力的关键。本文研究了当前最先进的SP方法,即DeepSpeed-Ulysses和环形注意力(Ring-Attention),并提出了一种统一的SP方法,该方法对Transformer模型架构和网络硬件拓扑具有更强的鲁棒性。本文比较了SP与现有并行策略(包括数据/张量/零/专家/流水线并行)的通信与内存开销,并讨论了涉及SP的混合4D并行的最佳设计方案。我们在两个8×A800节点上,针对LLAMA3-8B模型采用SP处理序列长度208K时,实现了86%的模型计算效率(MFU)。我们的代码已公开在\url{https://github.com/feifeibear/long-context-attention}。