Transformer-based large language models (LLMs) rely on key-value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long-context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work proposes to improve eviction quality by "glimpsing into the future", in which a draft generator produces a surrogate future response approximating the target model's true response, and this surrogate is subsequently used to estimate the importance of cached KV more accurately. However, these approaches rely on computationally expensive draft generation, which introduces substantial prefilling overhead and limits their practicality in real-world deployment. To address this challenge, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without requiring explicit draft generation. LookaheadKV augments transformer layers with parameter-efficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long-context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperforms recent competitive baselines in various long-context understanding tasks, but also reduces the eviction cost by up to 14.5x, leading to significantly faster time-to-first-token. Our code is available at https://github.com/SamsungLabs/LookaheadKV.
翻译:基于Transformer的大语言模型在自回归推理中依赖键值缓存以避免冗余计算。虽然该机制极大提升了效率,但缓存大小随输入序列长度线性增长,在长上下文任务中迅速成为性能瓶颈。现有解决方案通过依据估计的重要性分数驱逐被认为不重要的提示KV来缓解此问题。值得注意的是,近期一系列研究提出通过"窥探未来"提升驱逐质量:即通过草稿生成器产生近似目标模型真实响应的代理未来响应,并利用该代理更精确地估计缓存KV的重要性。然而,这些方法依赖计算成本高昂的草稿生成,引入了显著的预填充开销,限制了其在实际部署中的实用性。为应对这一挑战,我们提出LookaheadKV——一种轻量级驱逐框架,该框架无需显式草稿生成即可利用代理未来响应的优势。LookaheadKV通过训练参数高效的模块增强Transformer层,以高精度预测真实重要性分数。我们的设计确保了与现有低成本启发式方法相当的、可忽略的运行时开销,同时实现了优于更高成本近似方法的准确性。在多种模型上进行的长上下文理解基准测试实验表明,我们的方法不仅在各类长上下文理解任务中优于近期竞争基线,还将驱逐成本降低高达14.5倍,从而显著缩短首词生成时间。代码已开源:https://github.com/SamsungLabs/LookaheadKV。