Retrieval-augmented generation (RAG) enhances large language models by incorporating context retrieved from external knowledge sources. While the effectiveness of the retrieval module is typically evaluated with relevance-based ranking metrics, such metrics may be insufficient to reflect the retrieval's impact on the final RAG result, especially in long-form generation scenarios. We argue that providing a comprehensive retrieval-augmented context is important for long-form RAG tasks like report generation and propose metrics for assessing the context independent of generation. We introduce CRUX, a \textbf{C}ontrolled \textbf{R}etrieval-a\textbf{U}gmented conte\textbf{X}t evaluation framework designed to directly assess retrieval-augmented contexts. This framework uses human-written summaries to control the information scope of knowledge, enabling us to measure how well the context covers information essential for long-form generation. CRUX uses question-based evaluation to assess RAG's retrieval in a fine-grained manner. Empirical results show that CRUX offers more reflective and diagnostic evaluation. Our findings also reveal substantial room for improvement in current retrieval methods, pointing to promising directions for advancing RAG's retrieval. Our data and code are publicly available to support and advance future research on retrieval.
翻译:检索增强生成(RAG)通过整合从外部知识源检索的上下文来增强大型语言模型。虽然检索模块的有效性通常通过基于相关性的排序指标进行评估,但此类指标可能不足以反映检索对最终RAG结果的影响,尤其是在长文本生成场景中。我们认为,为报告生成等长文本RAG任务提供全面的检索增强上下文至关重要,并提出了独立于生成过程评估上下文的指标。我们提出了CRUX,一个专为直接评估检索增强上下文而设计的\textbf{可}控\textbf{检}索增\textbf{强}上\textbf{下}文评估框架。该框架利用人工撰写的摘要控制知识的信息范围,使我们能够衡量上下文对长文本生成所需核心信息的覆盖程度。CRUX采用基于问题的评估方法,以细粒度方式评估RAG的检索性能。实证结果表明,CRUX提供了更具反映性和诊断性的评估。我们的研究还揭示了当前检索方法存在显著的改进空间,为推进RAG检索指明了有前景的方向。我们公开了数据和代码,以支持并推进未来关于检索的研究。