Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator.
翻译:针对检索增强生成(RAG)系统的静态基准常面临快速饱和问题,且需要大量人工维护以保持其鲁棒性。为解决此问题,我们提出了IRB框架,用于自动化生成评估RAG系统事实性的基准。IRB采用结构化生成流程,利用“事实支架”与“算法支架”构建基准。我们应用IRB构建了一个基准测试集,并对前沿大语言模型与检索器进行了评估。结果表明,在闭卷设置下,IRB对前沿大语言模型构成了显著挑战。此外,评估显示推理型大语言模型具有更高的可靠性,且相较于单纯扩展生成器规模,提升检索组件可能以更低的成本效益获得RAG系统正确性的改进。