Rapid advances in Large Language Models (LLMs) have spurred demand for processing extended context sequences in contemporary applications. However, this progress faces two challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues limit LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a training-free method for efficient and accurate long-context inference. TokenSelect builds upon the observation of non-contiguous attention sparsity, using QK dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a few critical KV cache tokens in attention calculation without sacrificing accuracy. To further accelerate TokenSelect, we design the Selection Cache based on observations of consecutive Query similarity and implemented the efficient Paged Dot Product Kernel, significantly reducing the selection overhead. A comprehensive evaluation of TokenSelect demonstrates up to $23.84\times$ speedup in attention computation and up to $2.28\times$ acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods.
翻译:大型语言模型(LLM)的快速发展推动了当代应用中对长上下文序列处理的需求。然而,这一进展面临两大挑战:因序列长度超出分布范围导致的性能下降,以及注意力机制二次计算复杂度引发的过长推理时间。这些问题限制了LLM在长上下文场景中的应用。本文提出动态令牌级键值缓存选择(TokenSelect),一种无需训练即可实现高效精准长上下文推理的方法。TokenSelect基于对非连续注意力稀疏性的观察,利用查询-键点积在令牌级别度量每个注意力头中键值缓存的重要性。通过基于每个注意力头的软投票机制,TokenSelect在保持精度的前提下,仅选择少量关键键值缓存令牌参与注意力计算。为进一步加速TokenSelect,我们基于连续查询相似性的观察设计了选择缓存,并实现了高效的分页点积核,显著降低了选择开销。对TokenSelect的综合评估表明,其在注意力计算中最高可实现$23.84\times$的加速,端到端延迟最高提升$2.28\times$,同时相比当前最先进的长上下文推理方法展现出更优越的性能。