The rapid advancement of Large Language Models (LLMs) has driven growing demand for processing extended context sequences in contemporary applications. However, this progress faces two major challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues hinder the application of LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a training-free method for efficient and accurate long-context inference. TokenSelect builds upon the observation of non-contiguous attention sparsity, using Query-Key dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a few critical KV cache tokens in attention calculation without sacrificing accuracy. To further accelerate TokenSelect, we design the Selection Cache based on observations of consecutive Query similarity and implemented efficient dot product kernel, significantly reducing the overhead. A comprehensive evaluation of TokenSelect demonstrates up to 23.84x speedup in attention computation and up to 2.28x acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods.
翻译:大型语言模型(LLM)的快速发展推动了当代应用中对处理长上下文序列日益增长的需求。然而,这一进展面临两大挑战:因序列长度超出分布范围而导致的性能下降,以及注意力机制二次计算复杂度引发的过长推理时间。这些问题阻碍了LLM在长上下文场景中的应用。本文提出动态令牌级KV缓存选择方法(TokenSelect),这是一种无需训练即可实现高效精准长上下文推理的方法。TokenSelect基于对非连续注意力稀疏性的观察,利用查询-键点积在令牌级别度量每个注意力头中KV缓存的重要性。通过基于每个注意力头的软投票机制,TokenSelect在不牺牲准确性的前提下,仅选择少量关键KV缓存令牌参与注意力计算。为进一步加速TokenSelect,我们基于连续查询相似性的观察设计了选择缓存,并实现了高效的点积计算内核,显著降低了开销。对TokenSelect的综合评估表明,其在注意力计算上最高可实现23.84倍加速,端到端延迟最高提升2.28倍,同时相比当前最先进的长上下文推理方法展现出更优越的性能。