Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements inherent in self-attention mechanisms. In this work, we introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome these computational and memory obstacles while maintaining performance. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query, thereby enabling gradient-based optimization. As a result, SPARSEK Attention offers linear time complexity and constant memory footprint during generation. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods and provides significant speed improvements during both training and inference, particularly in language modeling and downstream tasks. Furthermore, our method can be seamlessly integrated into pre-trained Large Language Models (LLMs) with minimal fine-tuning, offering a practical solution for effectively managing long-range dependencies in diverse applications.
翻译:在自回归Transformer中高效处理长序列,尤其是在扩展的上下文窗口内,由于自注意力机制固有的二次计算复杂度和巨大的键值(KV)内存需求,带来了重大挑战。本文中,我们提出了SPARSEK注意力,这是一种新颖的稀疏注意力机制,旨在克服这些计算和内存障碍,同时保持性能。我们的方法集成了一个评分网络和一个可微分的top-k掩码算子SPARSEK,为每个查询选择恒定数量的KV对,从而实现基于梯度的优化。因此,SPARSEK注意力在生成过程中提供了线性时间复杂度和恒定的内存占用。实验结果表明,SPARSEK注意力优于先前的稀疏注意力方法,并在训练和推理过程中,特别是在语言建模和下游任务中,提供了显著的速度提升。此外,我们的方法可以无缝集成到预训练的大型语言模型(LLMs)中,仅需微调即可,为有效管理多样化应用中的长程依赖关系提供了一个实用的解决方案。