Background Large Language Models (LLMs), enhanced with Clinical Practice Guidelines (CPGs), can significantly improve Clinical Decision Support (CDS). However, methods for incorporating CPGs into LLMs are not well studied. Methods We develop three distinct methods for incorporating CPGs into LLMs: Binary Decision Tree (BDT), Program-Aided Graph Construction (PAGC), and Chain-of-Thought-Few-Shot Prompting (CoT-FSP). To evaluate the effectiveness of the proposed methods, we create a set of synthetic patient descriptions and conduct both automatic and human evaluation of the responses generated by four LLMs: GPT-4, GPT-3.5 Turbo, LLaMA, and PaLM 2. Zero-Shot Prompting (ZSP) was used as the baseline method. We focus on CDS for COVID-19 outpatient treatment as the case study. Results All four LLMs exhibit improved performance when enhanced with CPGs compared to the baseline ZSP. BDT outperformed both CoT-FSP and PAGC in automatic evaluation. All of the proposed methods demonstrated high performance in human evaluation. Conclusion LLMs enhanced with CPGs demonstrate superior performance, as compared to plain LLMs with ZSP, in providing accurate recommendations for COVID-19 outpatient treatment, which also highlights the potential for broader applications beyond the case study.
翻译:背景:融入临床实践指南(CPGs)的大型语言模型(LLMs)可显著改善临床决策支持(CDS),但目前关于将CPGs整合至LLMs的方法研究尚不充分。方法:我们开发了三种将CPGs融入LLMs的特定方法:二叉决策树(BDT)、程序辅助图构建(PAGC)和思维链-少样本提示(CoT-FSP)。为评估所提方法的有效性,我们构建了一组合成患者描述,并对四个LLM(GPT-4、GPT-3.5 Turbo、LLaMA和PaLM 2)生成的响应开展自动化评估与人工评估,以零样本提示(ZSP)作为基线方法。本研究以COVID-19门诊治疗CDS作为案例。结果:相较于基线ZSP,所有经CPGs增强的LLMs均呈现性能提升。在自动化评估中,BDT方法表现优于CoT-FSP和PAGC;所有方法在人工评估中均展现出较高性能。结论:与仅使用ZSP的普通LLMs相比,经CPGs增强的LLMs在COVID-19门诊治疗推荐准确性方面表现更优,这也凸显了该方案在案例研究之外的更广泛应用潜力。