Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance to rigorously evaluate LLM-generated responses. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications.
翻译:检索增强生成(RAG)是一种强大的方法,能够使大语言模型(LLM)整合外部知识。然而,由于数据构建成本高昂且缺乏合适的评估指标,评估RAG系统在特定场景下的有效性仍然具有挑战性。本文介绍了RAGEval,这是一个旨在通过基于模式的流水线生成高质量文档、问题、答案和参考信息,从而跨多样化场景评估RAG系统的框架。着眼于事实准确性,我们提出了三个新颖的指标——完整性、幻觉性和无关性——以严格评估LLM生成的回答。实验结果表明,在生成样本的清晰度、安全性、符合度和丰富性方面,RAGEval均优于零样本和单样本方法。此外,使用LLM对所提指标进行评分的结果与人工评估表现出高度一致性。RAGEval为现实应用中的RAG系统评估建立了一个新范式。