Large language models (LLMs) have demonstrated powerful text generation capabilities, bringing unprecedented innovation to the healthcare field. While LLMs hold immense promise for applications in healthcare, applying them to real clinical scenarios presents significant challenges, as these models may generate content that deviates from established medical facts and even exhibit potential biases. In our research, we develop an augmented LLM framework based on the Unified Medical Language System (UMLS), aiming to better serve the healthcare community. We employ LLaMa2-13b-chat and ChatGPT-3.5 as our benchmark models, and conduct automatic evaluations using the ROUGE Score and BERTScore on 104 questions from the LiveQA test set. Additionally, we establish criteria for physician-evaluation based on four dimensions: Factuality, Completeness, Readability and Relevancy. ChatGPT-3.5 is used for physician evaluation with 20 questions on the LiveQA test set. Multiple resident physicians conducted blind reviews to evaluate the generated content, and the results indicate that this framework effectively enhances the factuality, completeness, and relevance of generated content. Our research demonstrates the effectiveness of using UMLS-augmented LLMs and highlights the potential application value of LLMs in in medical question-answering.
翻译:大语言模型(LLMs)展现了强大的文本生成能力,为医疗领域带来了前所未有的创新。尽管LLMs在医疗应用中具有巨大潜力,但在真实临床场景中应用它们仍面临重大挑战,因为这些模型可能生成偏离既定医学事实的内容,甚至表现出潜在偏见。在我们的研究中,我们基于统一医学语言系统(UMLS)开发了一种增强型大语言模型框架,旨在更好地服务医疗社区。我们采用LLaMa2-13b-chat和ChatGPT-3.5作为基准模型,并在LiveQA测试集的104个问题上使用ROUGE分数和BERTScore进行自动评估。此外,我们制定了基于事实性、完整性、可读性和相关性四个维度的医师评估标准。使用ChatGPT-3.5对LiveQA测试集中的20个问题进行医师评估。多位住院医师进行了盲审以评估生成内容,结果表明该框架有效提升了生成内容的事实性、完整性和相关性。我们的研究证明了使用UMLS增强型LLM的有效性,并突出了LLMs在医学问答中的潜在应用价值。