Understanding the Fundamental Design Decisions of Retrieval-Augmented Generation Systems

Retrieval-Augmented Generation (RAG) has emerged as a critical technique for enhancing large language model (LLM) capabilities. However, practitioners face significant challenges when making RAG deployment decisions. While existing research prioritizes algorithmic innovations, a systematic gap persists in understanding fundamental engineering trade-offs that determine RAG success. We present the first comprehensive study of three universal RAG deployment decisions: whether to deploy RAG, how much information to retrieve, and how to integrate retrieved knowledge effectively. Through systematic experiments across three LLMs and six datasets spanning question answering and code generation tasks, we reveal critical insights: (1) RAG deployment must be highly selective, with variable recall thresholds and failure modes affecting up to 12.6\% of samples even with perfect documents. (2) Optimal retrieval volume exhibits task-dependent behavior QA tasks show universal patterns (5-10 documents optimal) while code generation requires scenario-specific optimization. (3) Knowledge integration effectiveness depends on task and model characteristics, with code generation benefiting significantly from prompting methods while question answering shows minimal improvement. These findings demonstrate that universal RAG strategies prove inadequate. Effective RAG systems require context-aware design decisions based on task characteristics and model capabilities. Our analysis provides evidence-based guidance for practitioners and establishes foundational insights for principled RAG deployment. Our code, data and artifacts are publicly available at https://github.com/ShengmingZ/RAG_Benchmark_Code_QA.

翻译：检索增强生成（RAG）已成为提升大语言模型（LLM）能力的关键技术。然而，实践者在进行RAG部署决策时面临重大挑战。尽管现有研究侧重于算法创新，但在理解决定RAG成功的基本工程权衡方面仍存在系统性空白。我们首次对三个通用RAG部署决策进行了全面研究：是否部署RAG、检索多少信息以及如何有效整合检索到的知识。通过在三个LLM和六个数据集（涵盖问答和代码生成任务）上的系统实验，我们揭示了关键发现：（1）RAG部署必须高度选择性，即使有完美文档，可变召回阈值和失败模式也会影响多达12.6%的样本。（2）最佳检索量表现出任务依赖性：问答任务呈现通用模式（5-10篇文档最优），而代码生成需要特定场景的优化。（3）知识整合的有效性取决于任务和模型特性，提示方法对代码生成有显著提升，而对问答任务改进甚微。这些发现表明，通用RAG策略并不充分。有效的RAG系统需要基于任务特性和模型能力的上下文感知设计决策。我们的分析为实践者提供了基于证据的指导，并为原则性RAG部署奠定了基础性见解。我们的代码、数据和制品已在 https://github.com/ShengmingZ/RAG_Benchmark_Code_QA 公开。