HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs

Hallucinations pose a significant challenge to the reliability and alignment of Large Language Models (LLMs), limiting their widespread acceptance beyond chatbot applications. Despite ongoing efforts, hallucinations remain a prevalent challenge in LLMs. The detection of hallucinations itself is also a formidable task, frequently requiring manual labeling or constrained evaluations. This paper introduces an automated scalable framework that combines benchmarking LLMs' hallucination tendencies with efficient hallucination detection. We leverage LLMs to generate challenging tasks related to hypothetical phenomena, subsequently employing them as agents for efficient hallucination detection. The framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain. We introduce the publicly available HypoTermQA Benchmarking Dataset, on which state-of-the-art models' performance ranged between 3% and 11%, and evaluator agents demonstrated a 6% error rate in hallucination prediction. The proposed framework provides opportunities to test and improve LLMs. Additionally, it has the potential to generate benchmarking datasets tailored to specific domains, such as law, health, and finance.

翻译：幻觉对大语言模型（LLMs）的可靠性和对齐性构成重大挑战，限制了其在聊天机器人应用之外的广泛接受度。尽管持续努力，幻觉仍是LLMs面临的普遍难题。幻觉检测本身也是一项艰巨任务，通常需要人工标注或受限评估。本文提出了一种自动化可扩展框架，将LLMs的幻觉倾向基准测试与高效幻觉检测相结合。我们利用LLMs生成与假设现象相关的挑战性任务，随后将其作为智能体进行高效幻觉检测。该框架具有领域无关性，允许在任意领域使用任意语言模型进行基准创建或评估。我们发布了公开可用的HypoTermQA基准测试数据集，在该数据集上，最先进模型的表现准确率介于3%至11%之间，评估智能体在幻觉预测中的错误率为6%。所提出的框架为测试和改进LLMs提供了机会，同时具备生成面向特定领域（如法律、健康、金融）定制基准测试数据集的潜力。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日