We introduce DAHL, a benchmark dataset and automated evaluation system designed to assess hallucination in long-form text generation, specifically within the biomedical domain. Our benchmark dataset, meticulously curated from biomedical research papers, consists of 8,573 questions across 29 categories. DAHL evaluates fact-conflicting hallucinations in Large Language Models (LLMs) by deconstructing responses into atomic units, each representing a single piece of information. The accuracy of these responses is averaged to produce the DAHL Score, offering a more in-depth evaluation of hallucinations compared to previous methods that rely on multiple-choice tasks. We conduct experiments with 8 different models, finding that larger models tend to hallucinate less; however, beyond a model size of 7 to 8 billion parameters, further scaling does not significantly improve factual accuracy. The DAHL Score holds potential as an efficient alternative to human-annotated preference labels, being able to be expanded to other specialized domains. We release the dataset and code in public.
翻译:我们提出了DAHL,这是一个专为评估长文本生成中的幻觉而设计的基准数据集与自动评估系统,尤其聚焦于生物医学领域。我们的基准数据集精心从生物医学研究论文中整理而成,涵盖29个类别,共计8,573个问题。DAHL通过将大语言模型(LLMs)的响应解构为原子单元(每个单元代表一个独立的信息片段)来评估事实冲突型幻觉。这些响应的准确率经平均后得到DAHL分数,与以往依赖多项选择任务的方法相比,该分数能提供更深入的幻觉评估。我们对8个不同模型进行了实验,发现更大的模型往往产生更少的幻觉;然而,当模型参数量超过70至80亿后,进一步的规模扩展并不会显著提升事实准确性。DAHL分数有潜力作为人工标注偏好标签的高效替代方案,并可扩展至其他专业领域。我们已公开数据集与代码。