DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine

We introduce DAHL, a benchmark dataset and automated evaluation system designed to assess hallucination in long-form text generation, specifically within the biomedical domain. Our benchmark dataset, meticulously curated from biomedical research papers, consists of 8,573 questions across 29 categories. DAHL evaluates fact-conflicting hallucinations in Large Language Models (LLMs) by deconstructing responses into atomic units, each representing a single piece of information. The accuracy of these responses is averaged to produce the DAHL Score, offering a more in-depth evaluation of hallucinations compared to previous methods that rely on multiple-choice tasks. We conduct experiments with 8 different models, finding that larger models tend to hallucinate less; however, beyond a model size of 7 to 8 billion parameters, further scaling does not significantly improve factual accuracy. The DAHL Score holds potential as an efficient alternative to human-annotated preference labels, being able to be expanded to other specialized domains. We release the dataset and code in public.

翻译：我们提出了DAHL，这是一个专为评估长文本生成中的幻觉而设计的基准数据集与自动评估系统，尤其聚焦于生物医学领域。我们的基准数据集精心从生物医学研究论文中整理而成，涵盖29个类别，共计8,573个问题。DAHL通过将大语言模型（LLMs）的响应解构为原子单元（每个单元代表一个独立的信息片段）来评估事实冲突型幻觉。这些响应的准确率经平均后得到DAHL分数，与以往依赖多项选择任务的方法相比，该分数能提供更深入的幻觉评估。我们对8个不同模型进行了实验，发现更大的模型往往产生更少的幻觉；然而，当模型参数量超过70至80亿后，进一步的规模扩展并不会显著提升事实准确性。DAHL分数有潜力作为人工标注偏好标签的高效替代方案，并可扩展至其他专业领域。我们已公开数据集与代码。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日