Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across multiple devices (e.g., GPUs). However, adopting a distributed approach inevitably introduces extra memory overheads to store local attention results and incurs additional communication costs to aggregate local results into global ones. In this paper, we propose a distributed attention framework named ``BurstAttention'' to optimize memory access and communication operations at both the global cluster and local device levels. In our experiments, we compare BurstAttention with other competitive distributed attention solutions for long sequence processing. The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, reducing 40% communication overheads and achieving 1.37 X speedup during training 128K sequence length on 32 X A100.
翻译:有效的注意力模块在基于Transformer的大型语言模型(LLMs)的成功中发挥了关键作用,但这些注意力模块的二次时间和内存复杂度在处理长序列时也带来了挑战。解决长序列问题的一个潜在方案是利用分布式集群将注意力模块的计算并行化到多个设备(如GPU)上。然而,采用分布式方法不可避免地会引入额外的内存开销来存储局部注意力结果,并产生额外的通信成本来将局部结果聚合为全局结果。本文提出了一种名为“BurstAttention”的分布式注意力框架,旨在优化全局集群和局部设备级别的内存访问及通信操作。在实验中,我们将BurstAttention与其他具有竞争力的长序列处理分布式注意力方案进行了比较。不同长度设置下的实验结果表明,与这些竞争基线相比,BurstAttention在处理长序列方面具有显著优势,减少了40%的通信开销,并在32块A100上训练128K序列长度时实现了1.37倍的加速。