This research paper focuses on the challenges posed by hallucinations in large language models (LLMs), particularly in the context of the medical domain. Hallucination, wherein these models generate plausible yet unverified or incorrect information, can have serious consequences in healthcare applications. We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations. Med-HALT provides a diverse multinational dataset derived from medical examinations across various countries and includes multiple innovative testing modalities. Med-HALT includes two categories of tests reasoning and memory-based hallucination tests, designed to assess LLMs's problem-solving and information retrieval abilities. Our study evaluated leading LLMs, including Text Davinci, GPT-3.5, LlaMa-2, MPT, and Falcon, revealing significant differences in their performance. The paper provides detailed insights into the dataset, promoting transparency and reproducibility. Through this work, we aim to contribute to the development of safer and more reliable language models in healthcare. Our benchmark can be found at medhalt.github.io
翻译:本研究聚焦于大型语言模型(LLMs)在医学领域产生的“幻觉”挑战。这类模型可能生成看似合理但未经证实或存在错误的信息,对医疗健康应用具有严重潜在影响。我们提出全新基准测试与数据集Med-HALT(医学领域幻觉测试),该数据集基于多国医学考试构建,涵盖多样化的跨国数据,并采用多种创新测试模态。Med-HALT包含两类测试——逻辑推理型幻觉测试与记忆型幻觉测试,旨在评估LLMs的问题求解与信息检索能力。通过对Text Davinci、GPT-3.5、LlaMa-2、MPT、Falcon等主流LLMs的评估,我们发现其性能存在显著差异。本文详细阐述了数据集构成,以提升透明度和可复现性。通过此项工作,我们致力于推动医疗健康领域更安全、更可靠的语言模型发展。基准测试已发布于medhalt.github.io