With the development of large language models (LLMs), the ability to handle longer contexts has become a key capability for Web applications such as cross-document understanding and LLM-powered search systems. However, this progress faces two major challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues hinder the application of LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a model-agnostic, training-free method for efficient and accurate long-context inference. TokenSelect builds upon the observation of non-contiguous attention sparsity, using Query-Key dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a small number of critical KV cache tokens in the attention calculation without sacrificing accuracy. To further accelerate TokenSelect, we designed the Selection Cache based on observations of consecutive Query similarity and implemented efficient dot product kernel, significantly reducing the overhead of token selection. A comprehensive evaluation of TokenSelect demonstrates up to 23.84x speedup in attention computation and up to 2.28x acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods.
翻译:随着大型语言模型(LLM)的发展,处理更长上下文的能力已成为跨文档理解和LLM驱动的搜索系统等Web应用的关键能力。然而,这一进展面临两大挑战:序列长度超出分布范围导致的性能下降,以及注意力机制二次计算复杂度引起的过长推理时间。这些问题阻碍了LLM在长上下文场景中的应用。本文提出动态令牌级KV缓存选择(TokenSelect),这是一种与模型无关、无需训练的高效准确长上下文推理方法。TokenSelect基于非连续注意力稀疏性的观察,利用查询-键点积在令牌级别度量每个注意力头的KV缓存关键性。通过基于每个注意力头的软投票机制,TokenSelect在注意力计算中仅选择性纳入少量关键KV缓存令牌,且不牺牲准确性。为进一步加速TokenSelect,我们基于连续查询相似性的观察设计了选择缓存,并实现了高效点积内核,显著降低了令牌选择开销。对TokenSelect的综合评估表明,其在注意力计算中最高可实现23.84倍加速,端到端延迟最高提升2.28倍,同时相比现有最先进的长上下文推理方法具有更优的性能。