As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have shown that a small portion of critical tokens will dominate the attention outcomes. However, we observe the criticality of a token highly depends on the query. To this end, we propose Quest, a query-aware KV cache selection algorithm. Quest keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy. We show that Quest can achieve up to 2.23x self-attention speedup, which reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss. Code is available at http://github.com/mit-han-lab/Quest .
翻译:随着对长上下文大语言模型(LLM)需求的增长,支持高达128K或1M令牌上下文窗口的模型正变得越来越普遍。然而,长上下文LLM推理面临挑战,因为推理速度会随着序列长度的增加而显著下降。这种减速主要由自注意力机制中加载庞大的键值(KV)缓存所导致。先前研究表明,一小部分关键令牌会主导注意力计算结果。但我们发现,令牌的关键性高度依赖于查询向量。为此,我们提出Quest——一种查询感知的KV缓存选择算法。Quest通过追踪KV缓存页中的最小与最大键值,并利用查询向量估算给定缓存页的关键性。仅加载关键性最高的Top-K个KV缓存页进行注意力计算,Quest在保持精度的同时显著加速了自注意力过程。实验表明,Quest可实现高达2.23倍的自注意力加速,将推理延迟降低7.03倍,并在长依赖任务中保持优异性能且精度损失可忽略。代码已开源:http://github.com/mit-han-lab/Quest。