Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), achieving remarkable performance across diverse tasks and enabling widespread real-world applications. However, LLMs are prone to hallucination, generating content that either conflicts with established knowledge or is unfaithful to the original sources. Existing hallucination benchmarks primarily focus on sentence- or passage-level hallucination detection, neglecting dialogue-level evaluation, hallucination localization, and rationale provision. They also predominantly target factuality hallucinations while underestimating faithfulness hallucinations, often relying on labor-intensive or non-specialized evaluators. To address these limitations, we propose HalluDial, the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. HalluDial encompasses both spontaneous and induced hallucination scenarios, covering factuality and faithfulness hallucinations. The benchmark includes 4,094 dialogues with a total of 146,856 samples. Leveraging HalluDial, we conduct a comprehensive meta-evaluation of LLMs' hallucination evaluation capabilities in information-seeking dialogues and introduce a specialized judge language model, HalluJudge. The high data quality of HalluDial enables HalluJudge to achieve superior or competitive performance in hallucination evaluation, facilitating the automatic assessment of dialogue-level hallucinations in LLMs and providing valuable insights into this phenomenon. The dataset and the code are available at https://github.com/FlagOpen/HalluDial.
翻译:大语言模型(LLMs)显著推动了自然语言处理(NLP)领域的发展,在各类任务中展现出卓越性能,并催生了广泛的实际应用。然而,LLMs易产生幻觉,即生成与既定知识相冲突或偏离原始源文本的内容。现有幻觉评估基准主要关注句子级或段落级幻觉检测,忽视了对话级评估、幻觉定位及理由支撑;同时,它们主要针对事实性幻觉,低估了忠实性幻觉,且通常依赖人工或非专业评估者。为弥补上述不足,我们提出HalluDial——首个面向自动对话级幻觉评估的综合性大规模基准。HalluDial涵盖自发与诱导两种幻觉场景,同时包括事实性与忠实性幻觉。该基准包含4,094个对话,共计146,856个样本。基于HalluDial,我们开展了对LLMs在信息寻求型对话中幻觉评估能力的全面元评估,并引入专用法官语言模型HalluJudge。HalluDial的高数据质量使HalluJudge在幻觉评估中达到更优或相当的性能,从而支持LLMs对话级幻觉的自动评估,并为该现象提供深入见解。数据集与代码已开源于https://github.com/FlagOpen/HalluDial。