Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving extended sequences or long-term dependencies. We present a distinct approach, Ring Attention, which leverages blockwise computation of self-attention to distribute long sequences across multiple devices while concurrently overlapping the communication of key-value blocks with the computation of blockwise attention. By processing longer input sequences while maintaining memory efficiency, Ring Attention enables training and inference of sequences that are device count times longer than those of prior memory-efficient Transformers, effectively eliminating the memory constraints imposed by individual devices. Extensive experiments on language modeling tasks demonstrate the effectiveness of Ring Attention in allowing large sequence input size and improving performance.
翻译:Transformer已成为众多最先进AI模型的首选架构,在各类人工智能应用中展现出卓越性能。然而,Transformer带来的内存需求限制了其处理长序列的能力,从而对涉及长序列或长程依赖的任务造成挑战。我们提出一种独特方法——环状注意力机制(Ring Attention),通过利用自注意力的分块计算将长序列分布到多个设备上,同时使键值块的通信与分块注意力的计算重叠进行。在保持内存效率的同时处理更长输入序列,环状注意力机制能够实现序列长度比先前内存高效Transformer长设备倍数倍的训练与推理,有效消除了单个设备的内存限制。在语言建模任务上的大量实验表明,环状注意力机制在支持大序列输入规模与提升性能方面具有显著效果。