Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving extended sequences or long-term dependencies. We present a distinct approach, Ring Attention, which leverages blockwise computation of self-attention to distribute long sequences across multiple devices while overlapping the communication of key-value blocks with the computation of blockwise attention. Ring Attention enables training and inference of sequences that are up to device count times longer than those of prior memory-efficient Transformers, effectively eliminating the memory constraints imposed by individual devices. Extensive experiments on language modeling tasks demonstrate the effectiveness of Ring Attention in allowing large sequence input size and improving performance.
翻译:变换器已成为众多先进AI模型的首选架构,在各类人工智能应用中展现出卓越性能。然而,变换器对内存的高需求限制了其处理长序列的能力,给涉及长序列或长期依赖关系的任务带来挑战。我们提出一种独特方法——环状注意力,通过分块计算自注意力机制,将长序列分布到多个设备上,同时将键值块通信与分块注意力计算重叠。该方法使训练和推理的序列长度相比现有内存高效型变换器提升至设备数量倍,有效消除了单个设备的内存限制。在语言建模任务上的大量实验表明,环状注意力在支持大序列输入规模和提升性能方面具有显著效果。