Large Language Models (LLMs) enhanced with retrieval, an approach known as Retrieval-Augmented Generation (RAG), have achieved strong performance in open-domain question answering. However, RAG remains prone to hallucinations: factually incorrect outputs may arise from inaccuracies in the model's internal knowledge and the retrieved context. Existing approaches to mitigating hallucinations often conflate factuality with faithfulness to the retrieved evidence, incorrectly labeling factually correct statements as hallucinations if they are not explicitly supported by the retrieval. In this paper, we introduce FRANQ, a new method for hallucination detection in RAG outputs. FRANQ applies distinct uncertainty quantification (UQ) techniques to estimate factuality, conditioning on whether a statement is faithful to the retrieved context. To evaluate FRANQ and competing UQ methods, we construct a new long-form question answering dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging cases. Extensive experiments across multiple datasets, tasks, and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing approaches.
翻译:基于检索增强的大型语言模型(LLMs),即检索增强生成(RAG)方法,在开放域问答任务中已展现出卓越性能。然而,RAG 系统仍易产生幻觉:事实错误的输出可能源于模型内部知识的不准确以及检索上下文中的错误。现有缓解幻觉的方法常将事实性与对检索证据的忠实性混为一谈,若生成陈述未得到检索内容的明确支持,即使其事实正确也会被错误地标记为幻觉。本文提出 FRANQ,一种用于检测 RAG 输出中幻觉的新方法。FRANQ 根据生成陈述是否忠实于检索上下文,应用不同的不确定性量化(UQ)技术来分别评估其事实性。为评估 FRANQ 及现有 UQ 方法,我们构建了一个新的长形式问答数据集,该数据集针对事实性与忠实性进行了标注,并融合了自动化标注与对疑难案例的人工验证。在多个数据集、任务及 LLMs 上的大量实验表明,相较于现有方法,FRANQ 能够更准确地检测 RAG 生成响应中的事实错误。