Retrieval augmented generation (RAG) combines the generative abilities of large language models (LLMs) with external knowledge sources to provide more accurate and up-to-date responses. Recent RAG advancements focus on improving retrieval outcomes through iterative LLM refinement or self-critique capabilities acquired through additional instruction tuning of LLMs. In this work, we introduce Speculative RAG - a framework that leverages a larger generalist LM to efficiently verify multiple RAG drafts produced in parallel by a smaller, distilled specialist LM. Each draft is generated from a distinct subset of retrieved documents, offering diverse perspectives on the evidence while reducing input token counts per draft. This approach enhances comprehension of each subset and mitigates potential position bias over long context. Our method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass over the drafts. Extensive experiments demonstrate that Speculative RAG achieves state-of-the-art performance with reduced latency on TriviaQA, MuSiQue, PubHealth, and ARC-Challenge benchmarks. It notably enhances accuracy by up to 12.97% while reducing latency by 51% compared to conventional RAG systems on PubHealth.
翻译:检索增强生成(RAG)将大型语言模型(LLM)的生成能力与外部知识源相结合,以提供更准确和更新的回答。近期RAG的进展侧重于通过迭代式LLM优化或通过额外指令微调获得的自我批判能力来改进检索结果。本文提出推测式RAG——一种利用更大的通用LM高效验证由更小的精炼专用LM并行生成的多个RAG草稿的框架。每个草稿基于检索文档的不同子集生成,在减少单次草稿输入标记数的同时,为证据提供多样化视角。该方法增强了对每个子集的理解,并缓解了长上下文中的潜在位置偏差。通过将草稿生成任务委托给较小的专用LM,并由较大的通用LM对草稿执行单次验证,我们的方法实现了RAG加速。大量实验表明,在TriviaQA、MuSiQue、PubHealth和ARC-Challenge基准测试中,推测式RAG以更低的延迟达到了最先进的性能。在PubHealth数据集上,与传统RAG系统相比,该方法显著将准确率提升高达12.97%,同时延迟降低51%。