Retrieval-Augmented Generation (RAG) has become a core paradigm for grounding large language models with external knowledge. Despite extensive efforts exploring diverse retrieval strategies, existing studies predominantly focus on query-side complexity or isolated method improvements, lacking a systematic understanding of how RAG paradigms behave across different query-corpus contexts and effectiveness-efficiency trade-offs. In this work, we introduce RAGRouter-Bench, the first dataset and benchmark designed for adaptive RAG routing. RAGRouter-Bench revisits retrieval from a query-corpus compatibility perspective and standardizes five representative RAG paradigms for systematic evaluation across 7,727 queries and 21,460 documents spanning diverse domains. The benchmark incorporates three canonical query types together with fine-grained semantic and structural corpus metrics, as well as a unified evaluation for both generation quality and resource consumption. Experiments with DeepSeek-V3 and LLaMA-3.1-8B demonstrate that no single RAG paradigm is universally optimal, that paradigm applicability is strongly shaped by query-corpus interactions, and that increased advanced mechanism does not necessarily yield better effectiveness-efficiency trade-offs. These findings underscore the necessity of routing-aware evaluation and establish a foundation for adaptive, interpretable, and generalizable next-generation RAG systems.
翻译:检索增强生成(RAG)已成为将大型语言模型与外部知识相结合的核心范式。尽管已有大量研究探索了多种检索策略,但现有工作主要关注查询侧复杂性或孤立的方法改进,缺乏对RAG范式在不同查询-语料库情境下的表现及其效果-效率权衡的系统性理解。本研究提出了首个面向自适应RAG路由的数据集与基准RAGRouter-Bench。该工作从查询-语料库兼容性视角重新审视检索过程,标准化了五种代表性RAG范式,并在涵盖多领域的7,727条查询与21,460份文档上进行了系统性评估。该基准整合了三种典型查询类型,以及细粒度的语义与结构化语料库指标,同时建立了生成质量与资源消耗的统一评估框架。基于DeepSeek-V3和LLaMA-3.1-8B的实验表明:不存在普遍最优的单一RAG范式;范式适用性受查询-语料库交互特性的显著影响;增强的复杂机制未必能带来更优的效果-效率权衡。这些发现凸显了路由感知评估的必要性,并为构建自适应、可解释、可泛化的下一代RAG系统奠定了基础。