With the rise of long-context language models (LMs) capable of processing tens of thousands of tokens in a single context window, do multi-stage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches? To assess this question, we conduct a controlled evaluation for QA tasks under systematically scaled token budgets, comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three baselines, including DOS RAG (Document's Original Structure RAG), a simple retrieve-then-read method that preserves original passage order. Despite its straightforward design, DOS RAG consistently matches or outperforms more intricate methods on multiple long-context QA benchmarks. We trace this strength to a combination of maintaining source fidelity and document structure, prioritizing recall within effective context windows, and favoring simplicity over added pipeline complexity. We recommend establishing DOS RAG as a simple yet strong baseline for future RAG evaluations, paired with state-of-the-art embedding and language models, and benchmarked under matched token budgets, to ensure that added pipeline complexity is justified by clear performance gains as models continue to improve.
翻译:随着能够单次处理数万标记的长上下文语言模型兴起,多阶段检索增强生成流程是否仍比简单的单阶段方法具有可衡量的优势?为评估此问题,我们在系统化调整的标记预算下对问答任务进行受控评估,将两种近期多阶段流程(ReadAgent与RAPTOR)与三种基线方法进行比较,其中包括保留原始段落顺序的简单检索后读取方法——DOS RAG。尽管设计简洁,DOS RAG在多个长上下文问答基准测试中始终匹配或超越更复杂的方法。我们将此优势归因于以下综合因素:保持源文本保真度与文档结构、在有效上下文窗口内优先召回率、以及选择简化设计而非增加流程复杂度。我们建议将DOS RAG确立为未来RAG评估的简洁而强大的基线,配合最先进的嵌入与语言模型,并在匹配的标记预算下进行基准测试,以确保随着模型持续改进,增加的流程复杂度能通过明确的性能提升获得合理验证。