FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification

Large language models are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many natural language processing applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or non-verifiable facts, making the use of a single factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.74, indicating that the benchmark remains a challenging task for future research. We release our dataset and code at https://github.com/XiangyanChen/FineDialFact.

翻译：大型语言模型常会产生幻觉——即事实错误或虚构的信息——这对许多自然语言处理应用（如对话系统）构成了重大挑战。因此，幻觉检测已成为一个关键研究领域。当前对话系统中的幻觉检测方法主要侧重于验证生成回复的事实一致性。然而，这些回复往往包含正确、不正确或无法验证的事实混合体，使得单一事实标签的使用过于简单粗粒度。本文提出面向细粒度对话事实核查的基准数据集FineDialFact，该任务要求对对话回复中提取的原子事实进行验证。为此，我们基于公开对话数据集构建了相应数据集，并采用多种基线方法进行评测。实验结果表明，引入思维链推理的方法可提升对话事实核查性能。尽管如此，该方法在开放域对话数据集HybriDialogue上取得的最佳F1分数仅为0.74，表明该基准仍是未来研究中的挑战性任务。我们已在https://github.com/XiangyanChen/FineDialFact 公开数据集与代码。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

扭曲还是编造？视频大语言模型幻觉研究综述

专知会员服务

14+阅读 · 4月15日

大语言模型与视觉模型中的幻觉现象理解综述

专知会员服务

21+阅读 · 2025年10月2日

《幻觉还是事实：国防大型语言模型的可信度评估研究》2025最新109页

专知会员服务

35+阅读 · 2025年9月16日

【AAAI2025】通过自适应多方面检索增强，利用大型语言模型进行知识图谱问答

专知会员服务

31+阅读 · 2024年12月26日