This research paper focuses on the challenges posed by hallucinations in large language models (LLMs), particularly in the context of the medical domain. Hallucination, wherein these models generate plausible yet unverified or incorrect information, can have serious consequences in healthcare applications. We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations. Med-HALT provides a diverse multinational dataset derived from medical examinations across various countries and includes multiple innovative testing modalities. Med-HALT includes two categories of tests reasoning and memory-based hallucination tests, designed to assess LLMs's problem-solving and information retrieval abilities. Our study evaluated leading LLMs, including Text Davinci, GPT-3.5, LlaMa-2, MPT, and Falcon, revealing significant differences in their performance. The paper provides detailed insights into the dataset, promoting transparency and reproducibility. Through this work, we aim to contribute to the development of safer and more reliable language models in healthcare. Our benchmark can be found at medhalt.github.io
翻译:本研究聚焦于大语言模型(LLM)在医学领域引发的幻觉挑战——这类模型可能生成看似合理但未经证实或错误的信息,在医疗应用中可能造成严重后果。我们提出名为Med-HALT(医学领域幻觉测试)的新型基准数据集,旨在专门评估并减少幻觉现象。该数据集基于多国医学考试构建,涵盖多样化跨国资料,并引入多项创新测试模式。Med-HALT包含两类测试:推理型与记忆型幻觉测试,分别评估大语言模型的问题解决能力与信息检索能力。我们对Text Davinci、GPT-3.5、LlaMa-2、MPT及Falcon等主流大语言模型进行评测,揭示了其性能的显著差异。本文详细阐明了数据集的特征,以促进研究的透明性与可复现性。通过此项工作,我们期望推动医疗领域更安全、更可靠的语言模型发展。本基准数据集可于medhalt.github.io获取。