Face4RAG: Factual Consistency Evaluation for Retrieval Augmented Generation in Chinese

The prevailing issue of factual inconsistency errors in conventional Retrieval Augmented Generation (RAG) motivates the study of Factual Consistency Evaluation (FCE). Despite the various FCE methods proposed earlier, these methods are evaluated on datasets generated by specific Large Language Models (LLMs). Without a comprehensive benchmark, it remains unexplored how these FCE methods perform on other LLMs with different error distributions or even unseen error types, as these methods may fail to detect the error types generated by other LLMs. To fill this gap, in this paper, we propose the first comprehensive FCE benchmark \emph{Face4RAG} for RAG independent of the underlying LLM. Our benchmark consists of a synthetic dataset built upon a carefully designed typology for factuality inconsistency error and a real-world dataset constructed from six commonly used LLMs, enabling evaluation of FCE methods on specific error types or real-world error distributions. On the proposed benchmark, we discover the failure of existing FCE methods to detect the logical fallacy, which refers to a mismatch of logic structures between the answer and the retrieved reference. To fix this issue, we further propose a new method called \emph{L-Face4RAG} with two novel designs of logic-preserving answer decomposition and fact-logic FCE. Extensive experiments show L-Face4RAG substantially outperforms previous methods for factual inconsistency detection on a wide range of tasks, notably beyond the RAG task from which it is originally motivated. Both the benchmark and our proposed method are publicly available.\footnote{\url{https://huggingface.co/datasets/yq27/Face4RAG}\label{link_face4rag}}

翻译：传统检索增强生成（RAG）中普遍存在的事实不一致错误问题，推动了事实一致性评估（FCE）的研究。尽管先前已提出多种FCE方法，但这些方法均在特定大型语言模型（LLMs）生成的数据集上进行评估。由于缺乏全面的基准测试，这些FCE方法在面对具有不同错误分布甚至未见错误类型的其他LLMs时表现如何，仍属未知领域——因为这些方法可能无法检测出其他LLMs生成的错误类型。为填补这一空白，本文提出了首个独立于底层LLMs的、面向RAG的综合性FCE基准测试 \emph{Face4RAG}。我们的基准包含一个基于精心设计的事实不一致错误类型学构建的合成数据集，以及一个从六个常用LLMs构建的真实世界数据集，从而能够评估FCE方法在特定错误类型或真实错误分布上的表现。在所提出的基准测试中，我们发现现有FCE方法在检测逻辑谬误方面存在不足，这种谬误指的是答案与检索到的参考信息之间逻辑结构的不匹配。为解决此问题，我们进一步提出了一种名为 \emph{L-Face4RAG} 的新方法，该方法包含两项新颖设计：逻辑保持的答案分解以及事实-逻辑FCE。大量实验表明，L-Face4RAG 在广泛的任务（显著超越了其最初动机所在的RAG任务）上，对于事实不一致检测的性能大幅优于先前的方法。基准测试及我们提出的方法均已公开。\footnote{\url{https://huggingface.co/datasets/yq27/Face4RAG}\label{link_face4rag}}

相关内容

数据集

关注 0

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日