The prevailing issue of factual inconsistency errors in conventional Retrieval Augmented Generation (RAG) motivates the study of Factual Consistency Evaluation (FCE). Despite the various FCE methods proposed earlier, these methods are evaluated on datasets generated by specific Large Language Models (LLMs). Without a comprehensive benchmark, it remains unexplored how these FCE methods perform on other LLMs with different error distributions or even unseen error types, as these methods may fail to detect the error types generated by other LLMs. To fill this gap, in this paper, we propose the first comprehensive FCE benchmark \emph{Face4RAG} for RAG independent of the underlying LLM. Our benchmark consists of a synthetic dataset built upon a carefully designed typology for factuality inconsistency error and a real-world dataset constructed from six commonly used LLMs, enabling evaluation of FCE methods on specific error types or real-world error distributions. On the proposed benchmark, we discover the failure of existing FCE methods to detect the logical fallacy, which refers to a mismatch of logic structures between the answer and the retrieved reference. To fix this issue, we further propose a new method called \emph{L-Face4RAG} with two novel designs of logic-preserving answer decomposition and fact-logic FCE. Extensive experiments show L-Face4RAG substantially outperforms previous methods for factual inconsistency detection on a wide range of tasks, notably beyond the RAG task from which it is originally motivated. Both the benchmark and our proposed method are publicly available.\footnote{\url{https://huggingface.co/datasets/yq27/Face4RAG}\label{link_face4rag}}
翻译:传统检索增强生成(RAG)中普遍存在的事实不一致错误问题,推动了事实一致性评估(FCE)的研究。尽管先前已提出多种FCE方法,但这些方法均在特定大型语言模型(LLMs)生成的数据集上进行评估。由于缺乏全面的基准测试,这些FCE方法在面对具有不同错误分布甚至未见错误类型的其他LLMs时表现如何,仍属未知领域——因为这些方法可能无法检测出其他LLMs生成的错误类型。为填补这一空白,本文提出了首个独立于底层LLMs的、面向RAG的综合性FCE基准测试 \emph{Face4RAG}。我们的基准包含一个基于精心设计的事实不一致错误类型学构建的合成数据集,以及一个从六个常用LLMs构建的真实世界数据集,从而能够评估FCE方法在特定错误类型或真实错误分布上的表现。在所提出的基准测试中,我们发现现有FCE方法在检测逻辑谬误方面存在不足,这种谬误指的是答案与检索到的参考信息之间逻辑结构的不匹配。为解决此问题,我们进一步提出了一种名为 \emph{L-Face4RAG} 的新方法,该方法包含两项新颖设计:逻辑保持的答案分解以及事实-逻辑FCE。大量实验表明,L-Face4RAG 在广泛的任务(显著超越了其最初动机所在的RAG任务)上,对于事实不一致检测的性能大幅优于先前的方法。基准测试及我们提出的方法均已公开。\footnote{\url{https://huggingface.co/datasets/yq27/Face4RAG}\label{link_face4rag}}