Retrieval-Augmented Generation (RAG) has recently gained traction in natural language processing. Numerous studies and real-world applications are leveraging its ability to enhance generative models through external information retrieval. Evaluating these RAG systems, however, poses unique challenges due to their hybrid structure and reliance on dynamic knowledge sources. To better understand these challenges, we conduct A Unified Evaluation Process of RAG (Auepora) and aim to provide a comprehensive overview of the evaluation and benchmarks of RAG systems. Specifically, we examine and compare several quantifiable metrics of the Retrieval and Generation components, such as relevance, accuracy, and faithfulness, within the current RAG benchmarks, encompassing the possible output and ground truth pairs. We then analyze the various datasets and metrics, discuss the limitations of current benchmarks, and suggest potential directions to advance the field of RAG benchmarks.
翻译:检索增强生成(RAG)近期在自然语言处理领域受到广泛关注。众多研究和实际应用正利用其通过外部信息检索增强生成模型的能力。然而,由于RAG系统的混合结构及其对动态知识源的依赖,评估这些系统提出了独特的挑战。为深入理解这些挑战,我们开展了“RAG统一评估流程”(Auepora)研究,旨在全面综述RAG系统的评估方法与基准测试。具体而言,我们在现有RAG基准测试框架内,针对检索与生成两大组件,系统考察并比较了相关性、准确性和忠实度等多个可量化指标,涵盖所有可能的输出与真实标注配对。进而,我们深入分析了各类数据集与评估指标,探讨当前基准测试的局限性,并为推进RAG基准测试领域的发展提出潜在研究方向。