Retrieval-augmented generation (RAG) has gained traction as a powerful approach for enhancing language models by integrating external knowledge sources. However, RAG introduces challenges such as retrieval latency, potential errors in document selection, and increased system complexity. With the advent of large language models (LLMs) featuring significantly extended context windows, this paper proposes an alternative paradigm, cache-augmented generation (CAG) that bypasses real-time retrieval. Our method involves preloading all relevant resources, especially when the documents or knowledge for retrieval are of a limited and manageable size, into the LLM's extended context and caching its runtime parameters. During inference, the model utilizes these preloaded parameters to answer queries without additional retrieval steps. Comparative analyses reveal that CAG eliminates retrieval latency and minimizes retrieval errors while maintaining context relevance. Performance evaluations across multiple benchmarks highlight scenarios where long-context LLMs either outperform or complement traditional RAG pipelines. These findings suggest that, for certain applications, particularly those with a constrained knowledge base, CAG provide a streamlined and efficient alternative to RAG, achieving comparable or superior results with reduced complexity.
翻译:检索增强生成(RAG)作为一种通过整合外部知识源增强语言模型的有效方法已获得广泛关注。然而,RAG也带来了检索延迟、文档选择潜在错误以及系统复杂度增加等挑战。随着具备显著扩展上下文窗口的大型语言模型(LLMs)的出现,本文提出一种替代范式——缓存增强生成(CAG),该方法绕过了实时检索过程。我们的技术方案涉及将所有相关资源(尤其在待检索文档或知识规模有限且易于管理的情况下)预加载至LLM的扩展上下文中,并缓存其运行时参数。在推理阶段,模型直接利用这些预加载参数响应查询,无需额外检索步骤。对比分析表明,CAG在保持上下文相关性的同时,完全消除了检索延迟并最大限度减少了检索错误。在多个基准测试上的性能评估表明,长上下文LLM在某些场景下可超越或补充传统RAG流程。这些发现提示,对于特定应用场景(尤其是知识库受限的任务),CAG能提供比RAG更简练高效的替代方案,在降低系统复杂度的同时获得相当或更优的结果。