RAGEval：面向特定场景的检索增强生成评估数据集生成框架 (RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework)

Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance to evaluate LLM generated responses rigorously. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications. The code and dataset are released at https://github.com/OpenBMB/RAGEval.

翻译：检索增强生成（RAG）是一种强大的方法，能够使大语言模型（LLM）整合外部知识。然而，由于数据构建成本高昂且缺乏合适的评估指标，在特定场景中评估RAG系统的有效性仍然具有挑战性。本文介绍了RAGEval，这是一个通过基于模式的流程生成高质量文档、问题、答案和参考依据，从而评估不同场景下RAG系统的框架。着眼于事实准确性，我们提出了三个新颖的指标：完整性、幻觉性和无关性，以严格评估LLM生成的响应。实验结果表明，RAGEval在生成样本的清晰度、安全性、一致性和丰富性方面优于零样本和单样本方法。此外，使用LLM对所提指标进行评分的结果与人工评估具有高度一致性。RAGEval为现实应用中评估RAG系统建立了新的范式。代码和数据集发布于 https://github.com/OpenBMB/RAGEval。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日