Retrieval-Augmented Generation (RAG) systems have demonstrated their advantages in alleviating the hallucination of Large Language Models (LLMs). Existing RAG benchmarks mainly focus on evaluating whether LLMs can correctly answer the general knowledge. However, they are unable to evaluate the effectiveness of the RAG system in dealing with the data from different vertical domains. This paper introduces RAGEval, a framework for automatically generating evaluation datasets to evaluate the knowledge usage ability of different LLMs in different scenarios. Specifically, RAGEval summarizes a schema from seed documents, applies the configurations to generate diverse documents, and constructs question-answering pairs according to both articles and configurations. We propose three novel metrics, Completeness, Hallucination, and Irrelevance, to carefully evaluate the responses generated by LLMs. By benchmarking RAG models in vertical domains, RAGEval has the ability to better evaluate the knowledge usage ability of LLMs, which avoids the confusion regarding the source of knowledge in answering question in existing QA datasets--whether it comes from parameterized memory or retrieval. The code and dataset will be released.
翻译:检索增强生成(RAG)系统已展现出其在缓解大语言模型(LLM)幻觉问题上的优势。现有的RAG基准测试主要关注评估LLM能否正确回答通用知识问题,但无法有效评估RAG系统在处理不同垂直领域数据时的效能。本文提出RAGEval,一个用于自动生成评估数据集的框架,以评估不同LLM在不同场景下的知识运用能力。具体而言,RAGEval从种子文档中归纳模式,应用配置生成多样化文档,并依据文章和配置构建问答对。我们提出了三个新颖的评估指标——完整性、幻觉性和无关性——以精细评估LLM生成的回答。通过在垂直领域对RAG模型进行基准测试,RAGEval能够更好地评估LLM的知识运用能力,从而避免现有问答数据集中关于答案知识来源(来自参数化记忆还是检索结果)的混淆。代码与数据集将公开发布。