Retrieval-augmented generation (RAG) enhances large language models (LLMs) with retrieved context but often suffers from downgraded prefill performance as modern applications demand longer and more complex inputs. Existing caching techniques either preserve accuracy with low cache reuse or improve reuse at the cost of degraded reasoning quality. We present RAGBoost, an efficient RAG system that achieves high cache reuse without sacrificing accuracy through accuracy-preserving context reuse. RAGBoost detects overlapping retrieved items across concurrent sessions and multi-turn interactions, using efficient context indexing, ordering, and de-duplication to maximize reuse, while lightweight contextual hints maintain reasoning fidelity. It integrates seamlessly with existing LLM inference engines and improves their prefill performance by 1.5-3X over state-of-the-art methods, while preserving or even enhancing reasoning accuracy across diverse RAG and agentic AI workloads. Our code is released at: https://github.com/Edinburgh-AgenticAI/RAGBoost.
翻译:检索增强生成(RAG)通过检索到的上下文增强大语言模型(LLM)的能力,但随着现代应用对更长、更复杂输入的需求增加,其预填充性能往往下降。现有的缓存技术要么以较低的缓存重用率保持精度,要么以提高重用率为代价牺牲推理质量。本文提出RAGBoost,一种高效的RAG系统,通过精度保持的上下文重用机制,在不牺牲准确性的前提下实现高缓存重用率。RAGBoost通过高效的上下文索引、排序与去重技术,检测并发会话和多轮交互中重叠的检索项,以最大化重用;同时借助轻量级上下文提示保持推理保真度。该系统可与现有LLM推理引擎无缝集成,在多种RAG与智能体AI工作负载上,其预填充性能较前沿方法提升1.5-3倍,且保持甚至提升了推理精度。代码已发布于:https://github.com/Edinburgh-AgenticAI/RAGBoost。