Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby posing challenges in utilizing videos, actions, and other long-form sequences and modalities in complex environments. We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers, without resorting to approximations or incurring additional communication and computation overheads. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of our approach in allowing millions of tokens context size and improving performance.
翻译:Transformer已成为许多最先进AI模型的首选架构,在各类AI应用中展现出卓越性能。然而,Transformer带来的内存需求限制了其对长序列的处理能力,从而在复杂环境中利用视频、动作及其他长序列和模态时造成挑战。我们提出一种新方法——基于分块Transformer的环注意力(Ring Attention),该方法利用自注意力与前馈网络的分块计算,将长序列分布到多个设备上,同时通过将键值块通信与分块注意力计算完全重叠。我们的方法使得训练和推理的序列长度可比先前高效内存Transformer的实现方法高出设备数量倍,且无需采用近似方法或引入额外通信与计算开销。在语言建模和强化学习任务上的大量实验证明,该方法能够实现百万级标记的上下文规模并提升性能。