Despite Retrieval-Augmented Generation (RAG) has shown promising capability in leveraging external knowledge, a comprehensive evaluation of RAG systems is still challenging due to the modular nature of RAG, evaluation of long-form responses and reliability of measurements. In this paper, we propose a fine-grained evaluation framework, RAGChecker, that incorporates a suite of diagnostic metrics for both the retrieval and generation modules. Meta evaluation verifies that RAGChecker has significantly better correlations with human judgments than other evaluation metrics. Using RAGChecker, we evaluate 8 RAG systems and conduct an in-depth analysis of their performance, revealing insightful patterns and trade-offs in the design choices of RAG architectures. The metrics of RAGChecker can guide researchers and practitioners in developing more effective RAG systems.
翻译:尽管检索增强生成(RAG)在利用外部知识方面展现出巨大潜力,但由于RAG的模块化特性、长文本响应的评估以及测量指标的可靠性等问题,对RAG系统进行全面评估仍然具有挑战性。本文提出了一种细粒度评估框架RAGChecker,该框架包含一套针对检索模块和生成模块的诊断指标。元评估验证了RAGChecker与人工判断的相关性显著优于其他评估指标。通过使用RAGChecker,我们评估了8个RAG系统,并对其性能进行了深入分析,揭示了RAG架构设计选择中富有洞察力的模式与权衡。RAGChecker的指标能够为研究者和实践者开发更有效的RAG系统提供指导。