Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

Modern retrieval-augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time-to-first-token(TTFT). Prefix-level KV reuse is now standard in serving stacks such as vLLM, and chunk-level and position-independent reuse have been pushed further by recent systems(RAGCache, TurboRAG, CacheBlend, EPIC, ContextPilot, PCR, LMCache). Output-level semantic answer caches, by contrast, remain fragile: similar prompts can map to different correct answers, retrieved evidence drifts as the corpus is updated, and adversarial collision attacks have been shown to hijack cached responses. We argue that the right framing for cached answer reuse is not how to reuse faster but when reuse is safe. We propose GroundedCache, an evidence-validated cache router that admits a cached answer only when 4 cheap gates simultaneously hold: query similarity, retrieved-evidence overlap, source-version validity, and lexical (or judge-based) support of the cached answer by the freshly retrieved evidence. We build a six-regime workload that stress-tests cache safety rather than only hit rate, and introduce an operator-facing metric, the unsafe-served rate (USR), fraction of all queries that received a wrong cached answer. Across 2 datasets and 12,000 real-LLM generations(Qwen2.5-7B-Instruct on vLLM with Automatic Prefix Caching), GroundedCache drives USR to 0.0% on every HotpotQA regime(vs. 15-35% under naive caching) and to 1.5% on mtRAG document drift(vs. 51.5%), a 34x reduction on the design-point adversarial regime and 3-10x reductions across the other mtRAG regimes, while end-to-end p50 latency stays within 1.04-1.07x of a no-cache RAG baseline. A per-gate ablation isolates the lexical support gate as the load-bearing safety mechanism on both datasets, with the remaining gates providing defense-in-depth at near-zero cost. We release the implementation, workload, and evaluation harness.

翻译：现代检索增强生成（RAG）部署日益依赖缓存以降低 token 开销和首 token 延迟（TTFT）。在 vLLM 等服务栈中，前缀级 KV 复用已成为标准配置，而近期系统（RAGCache、TurboRAG、CacheBlend、EPIC、ContextPilot、PCR、LMCache）进一步推动了分块级和位置无关的复用技术。相比之下，输出级语义答案缓存仍较为脆弱：相似的提示可能映射到不同的正确答案，检索证据会随语料库更新而发生漂移，且对抗性碰撞攻击已被证明能够劫持缓存的响应。我们认为，缓存答案复用的正确框架不在于如何更快地复用，而在于何时复用是安全的。为此，我们提出 GroundedCache——一种基于证据验证的缓存路由器，仅在四个低成本门控条件同时成立时才允许使用缓存答案：查询相似性、检索证据重叠度、来源版本有效性，以及新检索证据对缓存答案的词汇（或基于评判器）支持。我们构建了一个六场景的工作负载，用于重点测试缓存安全性（而非仅命中率），并引入面向运营商的指标——不安全服务率（USR），即接收到错误缓存答案的查询比例。在 2 个数据集和 12,000 次真实大语言模型生成（基于自动前缀缓存的 vLLM 上的 Qwen2.5-7B-Instruct）中，GroundedCache 在 HotpotQA 的所有场景下将 USR 降至 0.0%（朴素缓存下为 15-35%），在 mtRAG 文档漂移场景下降至 1.5%（朴素缓存下为 51.5%），在特定对抗场景下实现了 34 倍的降低，在 mtRAG 的其他场景下实现了 3-10 倍的降低，同时端到端 p50 延迟保持在无缓存 RAG 基线的 1.04-1.07 倍以内。逐门控消融实验表明，词汇支持门控在两个数据集上均为关键的安全机制，其余门控则以近乎零成本提供纵深防御。我们已开源实现代码、工作负载及评估框架。