The quadratic computational complexity of the attention mechanism in current Large Language Models (LLMs) renders inference with long contexts prohibitively expensive. To address this challenge, various approaches aim to retain critical portions of the context to optimally approximate Full Attention (FA) through Key-Value (KV) compression or Sparse Attention (SA), enabling the processing of virtually unlimited text lengths in a streaming manner. However, these methods struggle to achieve performance levels comparable to FA, particularly in retrieval tasks. In this paper, our analysis of attention head patterns reveals that LLMs' attention distributions show strong local correlations, naturally reflecting a chunking mechanism for input context. We propose Ltri-LLM framework, which divides KVs into spans, stores them in an offline index, and retrieves the relevant KVs into memory for various queries. Experimental results on popular long text benchmarks show that Ltri-LLM can achieve performance close to FA while maintaining efficient, streaming-based inference.
翻译:当前大语言模型(LLM)中注意力机制的二次计算复杂度,使得长上下文推理的成本变得极其高昂。为应对这一挑战,多种方法旨在通过键值(KV)压缩或稀疏注意力(SA)来保留上下文的关键部分,以最优方式逼近全注意力(FA),从而实现以流式方式处理近乎无限长度的文本。然而,这些方法难以达到与FA相当的性能水平,尤其是在检索任务中。本文通过对注意力头模式的分析发现,LLM的注意力分布表现出强烈的局部相关性,自然地反映了对输入上下文的分块机制。我们提出了Ltri-LLM框架,该框架将KVs划分为片段,存储于离线索引中,并为各种查询将相关的KVs检索到内存中。在主流长文本基准测试上的实验结果表明,Ltri-LLM能够在保持高效、基于流式的推理的同时,实现接近FA的性能。