Generating long sequences of tokens given a long-context input imposes a heavy computational burden for large language models (LLMs). One of the computational bottleneck comes from computing attention over a long sequence of input at each generation step. In this paper, we propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens. When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens, reducing the cost of data movement and attention computation. Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step. We evaluate our methods on RULER, a suite of tasks designed to comprehensively evaluate long-context abilities, and long-context language modeling tasks. Applying our method to off-the-shelf LLMs achieves comparable speedup to baselines which only consider local context while improving the performance by 2x. We further explore two ideas to improve performance-efficiency trade-offs: (1) dynamically decide when to perform recycled or full attention step based on the query similarities and (2) continued pre-training the model with Recycled Attention.
翻译:给定长上下文输入时生成长序列标记会给大型语言模型带来沉重的计算负担。其中一个计算瓶颈源于在每个生成步骤中计算对长输入序列的注意力。本文提出循环注意力,一种推理时方法,通过交替执行全上下文注意力与对输入标记子集的注意力计算来实现加速。在执行部分注意力时,我们复用先前执行过全注意力计算的标记的注意力模式,仅关注前K个最受关注的标记,从而降低数据移动和注意力计算的开销。相较于先前提出的仅关注局部上下文或具有高累积注意力得分标记的推理加速方法,我们的方法能灵活选择与当前解码步骤相关的标记。我们在RULER(一套为全面评估长上下文能力设计的任务集)和长上下文语言建模任务上评估了该方法。将本方法应用于现成大型语言模型时,可获得与仅考虑局部上下文的基线方法相当的加速效果,同时将性能提升两倍。我们进一步探索了两种改善性能-效率权衡的思路:(1)基于查询相似度动态决定何时执行循环注意力或全注意力步骤;(2)使用循环注意力机制进行持续预训练。