Mitigating hallucinations in healthcare LLMs with granular fact-checking and domain-specific adaptation

Musarrat Zeba,Abdullah Al Mamun,Kishoar Jahan Tithee,Debopom Sutradhar,Mohaimenul Azam Khan Raiaan,Saddam Mukta,Reem E. Mohamed,Md Rafiqul Islam,Yakub Sebastian,Mukhtar Hussain,Sami Azam

from arxiv, Published in Expert Systems with Applications

In healthcare, it is essential for any Large Language Model (LLM)-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRA) on the MIMIC-III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.

翻译：在医疗领域，任何大型语言模型生成的输出都必须可靠且准确，尤其是在涉及临床决策与患者安全的情境中。然而，由于大语言模型存在产生幻觉输出的风险，此类关键领域的输出往往不可靠。为解决这一问题，我们提出了一个独立于任何大语言模型运行的事实核查模块，并配合一个旨在降低幻觉率的领域专用摘要生成模型。我们的模型采用低秩适配方法在MIMIC-III数据集上进行微调，并与事实核查模块协同工作。该模块通过数值检验验证事实准确性，并利用自然语言处理中的离散逻辑对电子健康记录进行细粒度的逻辑验证。我们在完整MIMIC-III数据集上训练了大语言模型。为评估事实核查模块，我们抽样了104篇摘要，将其分解为3786个命题作为验证事实。事实核查模块的精确率达0.8904，召回率0.8234，F1分数0.8556。同时，大语言模型生成的摘要质量评估结果显示，ROUGE-1得分为0.5797，BERTScore为0.9120。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

基于大语言模型的医疗推理研究：综述与 MR-Bench 基准测试

专知会员服务

16+阅读 · 4月13日

《缓解大语言模型（LLMs）幻觉：面向应用的检索增强生成（RAG）、推理与智能体系统综述》

专知会员服务

24+阅读 · 2025年10月29日

医学领域大型语言模型的新进展

专知会员服务

25+阅读 · 2025年10月5日

大语言模型与视觉模型中的幻觉现象理解综述

专知会员服务

21+阅读 · 2025年10月2日