The prefill stage of long-context Retrieval-Augmented Generation (RAG) is severely bottlenecked by computational overhead. To mitigate this, recent methods assemble pre-calculated KV caches of retrieved RAG documents (by a user query) and reprocess selected tokens to recover cross-attention between these pre-calculated KV caches. However, we identify a fundamental "crowding-out effect" in current token selection criteria: globally salient but user-query-irrelevant tokens saturate the limited recomputation budget, displacing the tokens truly essential for answering the user query and degrading inference accuracy. We propose ProphetKV, a user-query-driven KV Cache reuse method for RAG scenarios. ProphetKV dynamically prioritizes tokens based on their semantic relevance to the user query and employs a dual-stage recomputation pipeline to fuse layer-wise attention metrics into a high-utility set. By ensuring the recomputation budget is dedicated to bridging the informational gap between retrieved context and the user query, ProphetKV achieves high-fidelity attention recovery with minimal overhead. Our extensive evaluation results show that ProphetKV retains 96%-101% of full-prefill accuracy with only a 20% recomputation ratio, while achieving accuracy improvements of 8.8%-24.9% on RULER and 18.6%-50.9% on LongBench over the state-of-the-art approaches (e.g., CacheBlend, EPIC, and KVShare).
翻译:长上下文检索增强生成(RAG)的预填充阶段严重受限于计算开销。为缓解此问题,现有方法通过组装检索到的RAG文档(基于用户查询)的预计算KV缓存,并对选定令牌进行重处理以恢复这些预计算KV缓存间的交叉注意力。然而,我们发现当前令牌选择标准中存在根本性的“挤出效应”:全局显著但与用户查询无关的令牌占满了有限的重计算预算,挤占了真正对回答用户查询至关重要的令牌,导致推理准确性下降。我们提出ProphetKV,一种面向RAG场景的用户查询驱动KV缓存重用方法。ProphetKV根据令牌与用户查询的语义相关性动态分配优先级,并采用双阶段重计算流程将层间注意力度量融合为高效用集合。通过确保重计算预算专门用于弥合检索上下文与用户查询间的信息鸿沟,ProphetKV能以最小开销实现高保真度的注意力恢复。大量评估结果表明,ProphetKV仅需20%的重计算比例即可保持96%-101%的完整预填充准确率,同时在RULER和LongBench基准上分别较现有最优方法(如CacheBlend、EPIC和KVShare)实现8.8%-24.9%和18.6%-50.9%的准确率提升。