Large language models are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many natural language processing applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or non-verifiable facts, making the use of a single factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.74, indicating that the benchmark remains a challenging task for future research. We release our dataset and code at https://github.com/XiangyanChen/FineDialFact.
翻译:大型语言模型常会产生幻觉——即事实错误或虚构的信息——这对许多自然语言处理应用(如对话系统)构成了重大挑战。因此,幻觉检测已成为一个关键研究领域。当前对话系统中的幻觉检测方法主要侧重于验证生成回复的事实一致性。然而,这些回复往往包含正确、不正确或无法验证的事实混合体,使得单一事实标签的使用过于简单粗粒度。本文提出面向细粒度对话事实核查的基准数据集FineDialFact,该任务要求对对话回复中提取的原子事实进行验证。为此,我们基于公开对话数据集构建了相应数据集,并采用多种基线方法进行评测。实验结果表明,引入思维链推理的方法可提升对话事实核查性能。尽管如此,该方法在开放域对话数据集HybriDialogue上取得的最佳F1分数仅为0.74,表明该基准仍是未来研究中的挑战性任务。我们已在https://github.com/XiangyanChen/FineDialFact 公开数据集与代码。