We demonstrate that geometrically distinctive keys during LLM inference tend to have high attention scores. Based on the phenomenon we propose KeyDiff, a training-free KV cache eviction method based solely on key similarity. Unlike other KV cache eviction methods, KeyDiff can process arbitrarily long prompts within strict resource constraints and efficiently generate responses. We provide a theoretical basis for KeyDiff by relating key diversity with attention scores. These results imply KeyDiff can efficiently identify the most important tokens to retain. Notably KeyDiff does not rely on attention scores, allowing the use of optimized attention mechanisms like FlashAttention. Under a strict memory allowance, we demonstrate the effectiveness of KeyDiff for the Llama and Qwen model families by observing a performance gap of less than 0.04% with 8K cache budget ($\sim$23% KV cache reduction) from the non-evicting baseline on LongBench for Llama 3.1-8B and Llama 3.2-3B. We also observe near baseline performance for Deepseek-R1-Distill-Llama-8B on the Math500 reasoning benchmark and decrease end-to-end inference latency by up to 30% compared to the other token-eviction methods.
翻译:我们证明,在LLM推理过程中,几何上具有区分性的键往往具有较高的注意力分数。基于这一现象,我们提出了KeyDiff——一种仅基于键相似性的免训练KV缓存淘汰方法。与其他KV缓存淘汰方法不同,KeyDiff能够在严格的资源约束下处理任意长度的提示,并高效生成响应。我们通过建立键多样性与注意力分数之间的关联,为KeyDiff提供了理论基础。这些结果表明KeyDiff能够高效识别出需要保留的最重要令牌。值得注意的是,KeyDiff不依赖于注意力分数,因此可以使用如FlashAttention等优化后的注意力机制。在严格的内存限制下,我们通过在LongBench基准测试中观察到,对于Llama 3.1-8B和Llama 3.2-3B模型,在8K缓存预算(约减少23% KV缓存)下,其性能与非淘汰基线相比差距小于0.04%,从而证明了KeyDiff在Llama和Qwen模型系列上的有效性。我们还在Math500推理基准测试中观察到Deepseek-R1-Distill-Llama-8B模型性能接近基线水平,并且与其他令牌淘汰方法相比,端到端推理延迟降低了最高达30%。