HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models - 专知论文

会员服务 ·

0

语言模型化 · MoDELS · ChatGPT · 知识 (knowledge) · INFORMS ·

2023 年 5 月 22 日

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

翻译：HaluEval：面向大型语言模型的大规模幻觉评估基准

Junyi Li,Xiaoxue Cheng,Wayne Xin Zhao,Jian-Yun Nie,Ji-Rong Wen

from arxiv, Working in progress

Large language models (LLMs), such as ChatGPT, are prone to generate hallucinations, \ie content that conflicts with the source or cannot be verified by the factual knowledge. To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation for Large Language Models (HaluEval) benchmark, a large collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognizing hallucination. To generate these samples, we propose a ChatGPT-based two-step framework, \ie sampling-then-filtering. Besides, we also hire some human labelers to annotate the hallucinations in ChatGPT responses. The empirical results suggest that ChatGPT is likely to generate hallucinated content in specific topics by fabricating unverifiable information (\ie about $11.4\%$ user queries). Moreover, existing LLMs face great challenges in recognizing the hallucinations in texts. While, our experiments also prove that the hallucination recognition can be improved by providing external knowledge or adding reasoning steps. Our benchmark can be accessed at https://github.com/RUCAIBox/HaluEval.

翻译：大型语言模型（如ChatGPT）容易产生幻觉，即生成与来源冲突或无法通过事实知识验证的内容。为理解大型语言模型在何种内容类型及程度上容易产生幻觉，我们提出了面向大型语言模型的幻觉评估基准（HaluEval），这是一个包含大规模生成与人工标注的幻觉样本的评估数据集，用于评测语言模型识别幻觉的能力。为生成这些样本，我们提出了一种基于ChatGPT的两步框架，即采样-过滤策略。此外，我们雇佣了人类标注员对ChatGPT响应中的幻觉进行标注。实验结果表明，ChatGPT倾向于在特定主题下通过编造不可验证信息（约占用户查询的11.4%）生成幻觉内容。当前主流语言模型在识别文本中的幻觉时面临巨大挑战。我们的实验同时证明，通过提供外部知识或增加推理步骤，可有效改善幻觉识别能力。本基准可在https://github.com/RUCAIBox/HaluEval 获取。

0

相关内容

语言模型化

语言模型化

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

16+阅读 · 2022年3月13日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

2019年机器学习框架回顾

2019年机器学习框架回顾

专知会员服务

36+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

80+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

深度学习自然语言处理

18+阅读 · 2020年5月22日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

靶向活化SIRT1调节tau外显子10可变剪接在阿尔茨海默病防治中的作用

国家自然科学基金

0+阅读 · 2014年12月31日

Progranulin在糖尿病肾病足细胞损伤中的保护作用及分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

外源有机物在稻田土壤中分解转化与甲烷排放关联性研究

国家自然科学基金

0+阅读 · 2013年12月31日

半导体二维电子体系自旋退相干与自旋输运实验研究：自旋－轨道耦合效应

国家自然科学基金

0+阅读 · 2012年12月31日

猪肝羧酸酯酶控制猪常用抗菌药疗效及毒副作用机制的研究

国家自然科学基金

0+阅读 · 2012年12月31日

高阶Schwarz导数与Teichmuller空间紧化

国家自然科学基金

0+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

BdDUOX和BdRelish在橘小实蝇肠道微生物群落稳态维持中的作用机理

国家自然科学基金

0+阅读 · 2012年12月31日

BRR2蛋白突变导致视网膜色素变性发病机制的研究

国家自然科学基金

0+阅读 · 2011年12月31日

抗纤维化靶基因SPARC（富含半胱氨酸的酸性分泌蛋白）的作用机制研究

国家自然科学基金

0+阅读 · 2009年12月31日

Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs

Arxiv

0+阅读 · 2023年7月7日

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Arxiv

0+阅读 · 2023年7月6日

A Survey on Evaluation of Large Language Models

Arxiv

0+阅读 · 2023年7月6日

Style Over Substance: Evaluation Biases for Large Language Models

Arxiv

0+阅读 · 2023年7月6日

Recommender Systems in the Era of Large Language Models (LLMs)

Arxiv

0+阅读 · 2023年7月5日

CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity and Infant Care

Arxiv

0+阅读 · 2023年7月4日

Evaluation of medium-large Language Models at zero-shot closed book generative question answering

Arxiv

0+阅读 · 2023年7月3日

Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models

Arxiv

66+阅读 · 2023年5月31日

Towards Expert-Level Medical Question Answering with Large Language Models

Arxiv

26+阅读 · 2023年5月16日

Towards Large-Scale Small Object Detection: Survey and Benchmarks

Arxiv

41+阅读 · 2022年7月28日

VIP会员

文章信息

相关主题

语言模型化

知识 (knowledge)

最新内容

《基于智能体建模与仿真的无人机蜂群模型目标定位涌现行为比较分析》360页

《基于智能体建模与仿真的无人机蜂群模型目标定位涌现行为比较分析》360页

专知会员服务

7+阅读 · 7月18日

欧洲智能弹药战略创新管理：迈向制导弹药、巡飞系统与自主无人机蜂群的技术主权研究路线图

欧洲智能弹药战略创新管理：迈向制导弹药、巡飞系统与自主无人机蜂群的技术主权研究路线图

专知会员服务

4+阅读 · 7月18日

从领域适配到部署与可解释：Berkeley博士论文解析大语言模型真实落地

从领域适配到部署与可解释：Berkeley博士论文解析大语言模型真实落地

专知会员服务

6+阅读 · 7月18日

综述 | 长程智能体研究全景：基础、演化、框架、优化与前沿

综述 | 长程智能体研究全景：基础、演化、框架、优化与前沿

专知会员服务

4+阅读 · 7月18日

DARPA拟打造十万规模自主思考作战的AI智能体集群：“受控涌现式分布式人工智能”（DICE）项目

DARPA拟打造十万规模自主思考作战的AI智能体集群：“受控涌现式分布式人工智能”（DICE）项目

专知会员服务

8+阅读 · 7月17日

《边缘端实时无线感知赋能现场多机器人部署》200页

《边缘端实时无线感知赋能现场多机器人部署》200页

专知会员服务

7+阅读 · 7月17日

战力倍增器：自主武器系统与乌克兰及加沙冲突

战力倍增器：自主武器系统与乌克兰及加沙冲突

专知会员服务

4+阅读 · 7月17日

人工智能赋能战场情报：提速决策进程

人工智能赋能战场情报：提速决策进程

专知会员服务

2+阅读 · 7月17日

《拥抱新兴技术：面向未来军官的教育革新》

《拥抱新兴技术：面向未来军官的教育革新》

专知会员服务

5+阅读 · 7月17日

ACM MM 2026 | MAR-GRPO：稳定混合图像生成的强化学习训练

ACM MM 2026 | MAR-GRPO：稳定混合图像生成的强化学习训练

专知会员服务

3+阅读 · 7月17日

综述 | 大模型水印理论与部署：来源追踪、攻击鲁棒与可信治理

综述 | 大模型水印理论与部署：来源追踪、攻击鲁棒与可信治理

专知会员服务

4+阅读 · 7月17日

《火线上的后勤保障：对抗环境下的随机规划模型研究——俄乌场景案例分析》99页

《火线上的后勤保障：对抗环境下的随机规划模型研究——俄乌场景案例分析》99页

专知会员服务

12+阅读 · 7月16日

《无人地面战车（UGV）的崛起》报告

《无人地面战车（UGV）的崛起》报告

专知会员服务

7+阅读 · 7月16日

《无人机参数化与集群飞行创新项目的监控流程管理：模型、策略及自适应解决方案》

《无人机参数化与集群飞行创新项目的监控流程管理：模型、策略及自适应解决方案》

专知会员服务

6+阅读 · 7月16日

《美军开放式任务系统（OMS）定义与文档（D&D）——Java关键抽象层（CAL）接口生成规范》47页标准

《美军开放式任务系统（OMS）定义与文档（D&D）——Java关键抽象层（CAL）接口生成规范》47页标准

专知会员服务

14+阅读 · 7月16日

相关VIP内容

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

16+阅读 · 2022年3月13日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

2019年机器学习框架回顾

2019年机器学习框架回顾

专知会员服务

36+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

80+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

欧洲智能弹药战略创新管理：迈向制导弹药、巡飞系统与自主无人机蜂群的技术主权研究路线图

综述 | 长程智能体研究全景：基础、演化、框架、优化与前沿

《基于智能体建模与仿真的无人机蜂群模型目标定位涌现行为比较分析》360页

从领域适配到部署与可解释：Berkeley博士论文解析大语言模型真实落地

相关资讯

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

深度学习自然语言处理

18+阅读 · 2020年5月22日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

相关论文

Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs

Arxiv

0+阅读 · 2023年7月7日

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Arxiv

0+阅读 · 2023年7月6日

A Survey on Evaluation of Large Language Models

Arxiv

0+阅读 · 2023年7月6日

Style Over Substance: Evaluation Biases for Large Language Models

Arxiv

0+阅读 · 2023年7月6日

Recommender Systems in the Era of Large Language Models (LLMs)

Arxiv

0+阅读 · 2023年7月5日

CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity and Infant Care

Arxiv

0+阅读 · 2023年7月4日

Evaluation of medium-large Language Models at zero-shot closed book generative question answering

Arxiv

0+阅读 · 2023年7月3日

Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models

Arxiv

66+阅读 · 2023年5月31日

Towards Expert-Level Medical Question Answering with Large Language Models

Arxiv

26+阅读 · 2023年5月16日

Towards Large-Scale Small Object Detection: Survey and Benchmarks

Arxiv

41+阅读 · 2022年7月28日

相关基金

靶向活化SIRT1调节tau外显子10可变剪接在阿尔茨海默病防治中的作用

国家自然科学基金

0+阅读 · 2014年12月31日

Progranulin在糖尿病肾病足细胞损伤中的保护作用及分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

外源有机物在稻田土壤中分解转化与甲烷排放关联性研究

国家自然科学基金

0+阅读 · 2013年12月31日

半导体二维电子体系自旋退相干与自旋输运实验研究：自旋－轨道耦合效应

国家自然科学基金

0+阅读 · 2012年12月31日

猪肝羧酸酯酶控制猪常用抗菌药疗效及毒副作用机制的研究

国家自然科学基金

0+阅读 · 2012年12月31日

高阶Schwarz导数与Teichmuller空间紧化

国家自然科学基金

0+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

BdDUOX和BdRelish在橘小实蝇肠道微生物群落稳态维持中的作用机理

国家自然科学基金

0+阅读 · 2012年12月31日

BRR2蛋白突变导致视网膜色素变性发病机制的研究

国家自然科学基金

0+阅读 · 2011年12月31日

抗纤维化靶基因SPARC（富含半胱氨酸的酸性分泌蛋白）的作用机制研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员