Recent studies have demonstrated promising performance of ChatGPT and GPT-4 on several medical domain tasks. However, none have assessed its performance using a large-scale real-world electronic health record database, nor have evaluated its utility in providing clinical diagnostic assistance for patients across a full range of disease presentation. We performed two analyses using ChatGPT and GPT-4, one to identify patients with specific medical diagnoses using a real-world large electronic health record database and the other, in providing diagnostic assistance to healthcare workers in the prospective evaluation of hypothetical patients. Our results show that GPT-4 across disease classification tasks with chain of thought and few-shot prompting can achieve performance as high as 96% F1 scores. For patient assessment, GPT-4 can accurately diagnose three out of four times. However, there were mentions of factually incorrect statements, overlooking crucial medical findings, recommendations for unnecessary investigations and overtreatment. These issues coupled with privacy concerns, make these models currently inadequate for real world clinical use. However, limited data and time needed for prompt engineering in comparison to configuration of conventional machine learning workflows highlight their potential for scalability across healthcare applications.
翻译:近期研究表明,ChatGPT和GPT-4在多项医学领域任务中展现出令人瞩目的性能。然而,目前尚未有研究利用大规模真实世界电子健康记录数据库评估其表现,也未检验其在全谱系疾病患者临床诊断辅助中的实用性。我们通过两项分析展开研究:其一利用真实世界大规模电子健康记录数据库,评估ChatGPT和GPT-4识别特定医学诊断患者的能力;其二在前瞻性假设患者评估中,检验其为医疗工作者提供诊断辅助的效果。结果显示,结合思维链与少样本提示的GPT-4在疾病分类任务中,F1分数最高可达96%。在患者评估方面,GPT-4每四次诊断中即有三次准确。然而,研究中发现存在事实性错误陈述、忽视关键医学发现、推荐不必要的检查及过度治疗等问题。这些问题加之隐私顾虑,使当前模型尚不适用于真实临床场景。但相比传统机器学习工作流的配置,提示工程所需数据量和时间成本更低,这凸显了模型在医疗应用中的规模化潜力。