In healthcare, it is essential for any Large Language Model (LLM)-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRA) on the MIMIC-III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.
翻译:在医疗领域,任何大型语言模型生成的输出都必须可靠且准确,尤其是在涉及临床决策与患者安全的情境中。然而,由于大语言模型存在产生幻觉输出的风险,此类关键领域的输出往往不可靠。为解决这一问题,我们提出了一个独立于任何大语言模型运行的事实核查模块,并配合一个旨在降低幻觉率的领域专用摘要生成模型。我们的模型采用低秩适配方法在MIMIC-III数据集上进行微调,并与事实核查模块协同工作。该模块通过数值检验验证事实准确性,并利用自然语言处理中的离散逻辑对电子健康记录进行细粒度的逻辑验证。我们在完整MIMIC-III数据集上训练了大语言模型。为评估事实核查模块,我们抽样了104篇摘要,将其分解为3786个命题作为验证事实。事实核查模块的精确率达0.8904,召回率0.8234,F1分数0.8556。同时,大语言模型生成的摘要质量评估结果显示,ROUGE-1得分为0.5797,BERTScore为0.9120。