Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across multiple devices (e.g., GPUs). However, adopting a distributed approach inevitably introduces extra memory overheads to store local attention results and incurs additional communication costs to aggregate local results into global ones. In this paper, we propose a distributed attention framework named ``BurstAttention'' to optimize memory access and communication operations at both the global cluster and local device levels. In our experiments, we compare BurstAttention with other competitive distributed attention solutions for long sequence processing. The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, reducing 40% communication overheads and achieving 2 X speedup during training 32K sequence length on 8 X A100.
翻译:有效的注意力模块在基于Transformer的大语言模型(LLMs)成功中发挥了关键作用,但这些注意力模块的二次时间和内存复杂度在处理长序列时也带来了挑战。解决长序列问题的一个潜在方案是利用分布式集群将注意力模块的计算并行化到多个设备(例如GPU)上。然而,采用分布式方法不可避免地会引入额外的内存开销来存储局部注意力结果,并产生额外的通信成本以将局部结果聚合为全局结果。本文提出一种名为"BurstAttention"的分布式注意力框架,旨在从全局集群和局部设备两个层面优化内存访问与通信操作。在实验中,我们将BurstAttention与其他具有竞争力的长序列分布式注意力解决方案进行对比。不同长度设置下的实验结果表明,与这些基线方法相比,BurstAttention在处理长序列方面具有显著优势:在8×A100上训练32K序列长度时,可减少40%的通信开销并实现2倍加速。