Transformer models trained on long sequences often achieve higher accuracy than short sequences. Unfortunately, conventional transformers struggle with long sequence training due to the overwhelming computation and memory requirements. Existing methods for long sequence training offer limited speedup and memory reduction, and may compromise accuracy. This paper presents a novel and efficient distributed training method, the Long Short-Sequence Transformer (LSS Transformer), for training transformer with long sequences. It distributes a long sequence into segments among GPUs, with each GPU computing a partial self-attention for its segment. Then, it uses a fused communication and a novel double gradient averaging technique to avoid the need to aggregate partial self-attention and minimize communication overhead. We evaluated the performance between LSS Transformer and the state-of-the-art Nvidia sequence parallelism on a Wikipedia enwik8 dataset. Results show that our proposed method lead to 5.6x faster and 10.2x more memory-efficient implementation compared to state-of-the-art sequence parallelism on 144 Nvidia V100 GPUs. Moreover, our algorithm scales to an extreme sequence length of 50,112 at 3,456 GPUs, achieving 161% super-linear parallel efficiency and a throughput of 32 petaflops.
翻译:基于长序列训练的Transformer模型通常比基于短序列训练的模型具有更高的准确度。然而,传统Transformer模型因计算和内存需求巨大,难以处理长序列训练。现有长序列训练方法在加速比和内存缩减方面效果有限,且可能牺牲模型精度。本文提出一种新颖高效的分布式训练方法——长短序列Transformer(LSS Transformer),用于训练长序列Transformer模型。该方法将长序列分割成多个片段分配到不同GPU上,每个GPU计算其对应片段的局部自注意力机制。随后采用融合通信技术与新型双梯度平均技术,避免聚合局部自注意力的需求,并最小化通信开销。我们在Wikipedia enwik8数据集上将LSS Transformer与当前最先进的Nvidia序列并行方法进行了性能比较。结果表明,相较于最先进的序列并行方法,本方法在144块Nvidia V100 GPU上实现了5.6倍更快的训练速度和10.2倍更高的内存效率。此外,我们的算法可在3456块GPU上扩展至50112的极端序列长度,实现161%超线性并行效率与32千万亿次浮点运算的吞吐量。