Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across multiple devices (e.g., GPUs). However, adopting a distributed approach inevitably introduces extra memory overheads to store local attention results and incurs additional communication costs to aggregate local results into global ones. In this paper, we propose a distributed attention framework named ``BurstAttention'' to optimize memory access and communication operations at both the global cluster and local device levels. In our experiments, we compare BurstAttention with other competitive distributed attention solutions for long sequence processing. The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, reducing 40% communication overheads and achieving 1.37 X speedup during training 128K sequence length on 32 X A100.
翻译:有效的注意力模块在基于Transformer的大型语言模型(LLMs)中发挥了关键作用,但这些注意力模块的二次时间和内存复杂度在处理长序列时也带来了挑战。应对长序列问题的一种潜在解决方案是利用分布式集群,将注意力模块的计算并行化到多个设备(例如GPU)上。然而,采用分布式方法不可避免地会引入额外的内存开销来存储局部注意力结果,并产生额外的通信成本以将局部结果聚合为全局结果。在本文中,我们提出了一种名为“BurstAttention”的分布式注意力框架,旨在全局集群和局部设备级别上优化内存访问和通信操作。在实验中,我们将BurstAttention与其他具有竞争力的长序列处理分布式注意力解决方案进行了比较。不同长度设置下的实验结果表明,与这些具有竞争力的基线相比,BurstAttention在处理长序列方面具有显著优势,在32块A100上训练128K序列长度时,减少了40%的通信开销,并实现了1.37倍的速度提升。