Retrieval-augmented generation (RAG) systems combine document retrieval with a generative model to address complex information seeking tasks like report generation. While the relationship between retrieval quality and generation effectiveness seems intuitive, it has not been systematically studied. We investigate whether upstream retrieval metrics can serve as reliable early indicators of the final generated response's information coverage. Through experiments across two text RAG benchmarks (TREC NeuCLIR 2024 and TREC RAG 2024) and one multimodal benchmark (WikiVideo), we analyze 15 text retrieval stacks and 10 multimodal retrieval stacks across four RAG pipelines and multiple evaluation frameworks (Auto-ARGUE and MiRAGE). Our findings demonstrate strong correlations between coverage-based retrieval metrics and nugget coverage in generated responses at both topic and system levels. This relationship holds most strongly when retrieval objectives align with generation goals, though more complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness. These findings provide empirical support for using retrieval metrics as proxies for RAG performance.
翻译:检索增强生成(RAG)系统将文档检索与生成模型相结合,以应对诸如报告生成等复杂信息获取任务。尽管检索质量与生成效能之间的关系看似直观,但尚未得到系统性研究。我们旨在探究上游检索指标是否可作为最终生成响应信息覆盖度的可靠早期指标。通过在两个文本RAG基准(TREC NeuCLIR 2024和TREC RAG 2024)及一个多模态基准(WikiVideo)上进行实验,我们分析了四种RAG流水线及多种评估框架(Auto-ARGUE和MiRAGE)下的15个文本检索栈和10个多模态检索栈。实验结果表明:在主题和系统层面上,基于覆盖度的检索指标与生成响应中的片段覆盖度存在强相关性。当检索目标与生成目标对齐时,这种关系最为显著,但更复杂的迭代RAG流水线可能部分解耦生成质量与检索效能。这些发现为使用检索指标作为RAG性能代理变量提供了经验支撑。