HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation - 专知论文

会员服务 ·

0

Performer · 任务对话系统 · 语言模型化 · MoDELS · 欠估计 ·

2024 年 6 月 11 日

HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

翻译：HalluDial：面向自动对话级幻觉评估的大规模基准

Wen Luo,Tianshu Shen,Wei Li,Guangyue Peng,Richeng Xuan,Houfeng Wang,Xi Yang

Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), achieving remarkable performance across diverse tasks and enabling widespread real-world applications. However, LLMs are prone to hallucination, generating content that either conflicts with established knowledge or is unfaithful to the original sources. Existing hallucination benchmarks primarily focus on sentence- or passage-level hallucination detection, neglecting dialogue-level evaluation, hallucination localization, and rationale provision. They also predominantly target factuality hallucinations while underestimating faithfulness hallucinations, often relying on labor-intensive or non-specialized evaluators. To address these limitations, we propose HalluDial, the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. HalluDial encompasses both spontaneous and induced hallucination scenarios, covering factuality and faithfulness hallucinations. The benchmark includes 4,094 dialogues with a total of 146,856 samples. Leveraging HalluDial, we conduct a comprehensive meta-evaluation of LLMs' hallucination evaluation capabilities in information-seeking dialogues and introduce a specialized judge language model, HalluJudge. The high data quality of HalluDial enables HalluJudge to achieve superior or competitive performance in hallucination evaluation, facilitating the automatic assessment of dialogue-level hallucinations in LLMs and providing valuable insights into this phenomenon. The dataset and the code are available at https://github.com/FlagOpen/HalluDial.

翻译：大语言模型（LLMs）显著推动了自然语言处理（NLP）领域的发展，在各类任务中展现出卓越性能，并催生了广泛的实际应用。然而，LLMs易产生幻觉，即生成与既定知识相冲突或偏离原始源文本的内容。现有幻觉评估基准主要关注句子级或段落级幻觉检测，忽视了对话级评估、幻觉定位及理由支撑；同时，它们主要针对事实性幻觉，低估了忠实性幻觉，且通常依赖人工或非专业评估者。为弥补上述不足，我们提出HalluDial——首个面向自动对话级幻觉评估的综合性大规模基准。HalluDial涵盖自发与诱导两种幻觉场景，同时包括事实性与忠实性幻觉。该基准包含4,094个对话，共计146,856个样本。基于HalluDial，我们开展了对LLMs在信息寻求型对话中幻觉评估能力的全面元评估，并引入专用法官语言模型HalluJudge。HalluDial的高数据质量使HalluJudge在幻觉评估中达到更优或相当的性能，从而支持LLMs对话级幻觉的自动评估，并为该现象提供深入见解。数据集与代码已开源于https://github.com/FlagOpen/HalluDial。

0

相关内容

Performer

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

32+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

164+阅读 · 2019年10月12日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

STRCF for Visual Object Tracking

STRCF for Visual Object Tracking

统计学习与视觉计算组

15+阅读 · 2018年5月29日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

KingsGarden

13+阅读 · 2017年7月16日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

Hamilton-Jacibi方程的弱KAM理论

国家自然科学基金

2+阅读 · 2017年12月31日

城市“建成环境——空间行为”的多尺度影响关系与机理研究

国家自然科学基金

13+阅读 · 2017年12月31日

“Fishes-in-net” 酵母孢子微胶囊式近平滑假丝酵母SCRII酶有机相高效手性合成机制研究

国家自然科学基金

3+阅读 · 2016年12月31日

Musielak-Orlicz-Sobolev 空间中的迹嵌入及其应用

国家自然科学基金

2+阅读 · 2015年12月31日

DMB信号水汽探测方法若干问题研究

国家自然科学基金

3+阅读 · 2015年12月31日

Volterra积分微分方程的多区间Chebyshev和Legendre谱配置法

国家自然科学基金

0+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

哈密尔顿系统及多体问题的周期解

国家自然科学基金

0+阅读 · 2014年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

Poisson流形上的修正Hamilton方法

国家自然科学基金

0+阅读 · 2014年12月31日

Large Language Models for Software Engineering: A Systematic Literature Review

Arxiv

14+阅读 · 2023年8月28日

A Survey on Large Language Models for Recommendation

Arxiv

12+阅读 · 2023年5月31日

Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis

Arxiv

11+阅读 · 2022年12月1日

Deep Graph Structure Learning for Robust Representations: A Survey

Arxiv

21+阅读 · 2021年3月4日

Directions for Explainable Knowledge-Enabled Systems

Directions for Explainable Knowledge-Enabled Systems

Arxiv

26+阅读 · 2020年3月17日

Anomalous Instance Detection in Deep Learning: A Survey

Anomalous Instance Detection in Deep Learning: A Survey

Arxiv

29+阅读 · 2020年3月16日

Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Arxiv

12+阅读 · 2019年9月23日

Deep Learning in Video Multi-Object Tracking: A Survey

Deep Learning in Video Multi-Object Tracking: A Survey

Arxiv

58+阅读 · 2019年7月31日

DSGAN: Generative Adversarial Training for Distant Supervision Relation Extraction

Arxiv

15+阅读 · 2018年5月24日

Caffeinated FPGAs: FPGA Framework For Convolutional Neural Networks

Arxiv

10+阅读 · 2016年9月30日

VIP会员

文章信息

相关主题

任务对话系统

语言模型化

最新内容

【CVPR2026教程】扩散模型的解析理解

【CVPR2026教程】扩散模型的解析理解

专知会员服务

0+阅读 · 23分钟前

【CVPR2026教程】从感知到模拟：多模态推理中世界模型的涌现

【CVPR2026教程】从感知到模拟：多模态推理中世界模型的涌现

专知会员服务

0+阅读 · 32分钟前

马赛克战：俄乌战场透析

马赛克战：俄乌战场透析

专知会员服务

13+阅读 · 今天4:12

《利用人工智能增强军事决策》

《利用人工智能增强军事决策》

专知会员服务

4+阅读 · 今天4:09

《自动机器学习在军事数据耕耘法中的应用》

《自动机器学习在军事数据耕耘法中的应用》

专知会员服务

6+阅读 · 今天4:02

为何指挥所生存能力要求范式转变

为何指挥所生存能力要求范式转变

专知会员服务

5+阅读 · 今天3:54

打造“新蛛网”模式与高科技动员

打造“新蛛网”模式与高科技动员

专知会员服务

4+阅读 · 今天3:33

“蛛网”行动一周年：远程无人机战争

“蛛网”行动一周年：远程无人机战争

专知会员服务

3+阅读 · 今天3:23

加沙、乌克兰和伊朗冲突：人工智能如何改变冲突

加沙、乌克兰和伊朗冲突：人工智能如何改变冲突

专知会员服务

3+阅读 · 今天3:15

为何“第一次人工智能战争（美以伊冲突）”仍是人类主导的斗争

为何“第一次人工智能战争（美以伊冲突）”仍是人类主导的斗争

专知会员服务

3+阅读 · 今天3:09

【剑桥博士论文】智能体-环境协同优化

【剑桥博士论文】智能体-环境协同优化

专知会员服务

7+阅读 · 6月9日

ACL 2026综述｜多模态基础模型测试时扩展：生成与推理统一框架

ACL 2026综述｜多模态基础模型测试时扩展：生成与推理统一框架

专知会员服务

5+阅读 · 6月9日

《面向国防应用的无人机选型：一种对比性多模糊多准则决策框架》

《面向国防应用的无人机选型：一种对比性多模糊多准则决策框架》

专知会员服务

12+阅读 · 6月9日

无人机战争：从乌克兰到中东战场的沙希德（Shahed）无人机分析

无人机战争：从乌克兰到中东战场的沙希德（Shahed）无人机分析

专知会员服务

8+阅读 · 6月9日

为初级军官战术训练设计生成式人工智能平台

为初级军官战术训练设计生成式人工智能平台

专知会员服务

9+阅读 · 6月9日

相关VIP内容

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

32+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

164+阅读 · 2019年10月12日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【CVPR2026教程】从感知到模拟：多模态推理中世界模型的涌现

《利用人工智能增强军事决策》

【CVPR2026教程】扩散模型的解析理解

马赛克战：俄乌战场透析

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

STRCF for Visual Object Tracking

STRCF for Visual Object Tracking

统计学习与视觉计算组

15+阅读 · 2018年5月29日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

KingsGarden

13+阅读 · 2017年7月16日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

相关论文

Large Language Models for Software Engineering: A Systematic Literature Review

Arxiv

14+阅读 · 2023年8月28日

A Survey on Large Language Models for Recommendation

Arxiv

12+阅读 · 2023年5月31日

Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis

Arxiv

11+阅读 · 2022年12月1日

Deep Graph Structure Learning for Robust Representations: A Survey

Arxiv

21+阅读 · 2021年3月4日

Directions for Explainable Knowledge-Enabled Systems

Directions for Explainable Knowledge-Enabled Systems

Arxiv

26+阅读 · 2020年3月17日

Anomalous Instance Detection in Deep Learning: A Survey

Anomalous Instance Detection in Deep Learning: A Survey

Arxiv

29+阅读 · 2020年3月16日

Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Arxiv

12+阅读 · 2019年9月23日

Deep Learning in Video Multi-Object Tracking: A Survey

Deep Learning in Video Multi-Object Tracking: A Survey

Arxiv

58+阅读 · 2019年7月31日

DSGAN: Generative Adversarial Training for Distant Supervision Relation Extraction

Arxiv

15+阅读 · 2018年5月24日

Caffeinated FPGAs: FPGA Framework For Convolutional Neural Networks

Arxiv

10+阅读 · 2016年9月30日

相关基金

Hamilton-Jacibi方程的弱KAM理论

国家自然科学基金

2+阅读 · 2017年12月31日

城市“建成环境——空间行为”的多尺度影响关系与机理研究

国家自然科学基金

13+阅读 · 2017年12月31日

“Fishes-in-net” 酵母孢子微胶囊式近平滑假丝酵母SCRII酶有机相高效手性合成机制研究

国家自然科学基金

3+阅读 · 2016年12月31日

Musielak-Orlicz-Sobolev 空间中的迹嵌入及其应用

国家自然科学基金

2+阅读 · 2015年12月31日

DMB信号水汽探测方法若干问题研究

国家自然科学基金

3+阅读 · 2015年12月31日

Volterra积分微分方程的多区间Chebyshev和Legendre谱配置法

国家自然科学基金

0+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

哈密尔顿系统及多体问题的周期解

国家自然科学基金

0+阅读 · 2014年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

Poisson流形上的修正Hamilton方法

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员