The successes of foundation models such as ChatGPT and AlphaFold have spurred significant interest in building similar models for electronic medical records (EMRs) to improve patient care and hospital operations. However, recent hype has obscured critical gaps in our understanding of these models' capabilities. We review over 80 foundation models trained on non-imaging EMR data (i.e. clinical text and/or structured data) and create a taxonomy delineating their architectures, training data, and potential use cases. We find that most models are trained on small, narrowly-scoped clinical datasets (e.g. MIMIC-III) or broad, public biomedical corpora (e.g. PubMed) and are evaluated on tasks that do not provide meaningful insights on their usefulness to health systems. In light of these findings, we propose an improved evaluation framework for measuring the benefits of clinical foundation models that is more closely grounded to metrics that matter in healthcare.
翻译:ChatGPT和AlphaFold等基础模型取得的成功引发了学界对构建类似电子病历(EMR)模型的极大兴趣,以期改善患者护理与医院运营。然而近期的热潮掩盖了我们对这些模型能力认知中的关键缺口。我们系统评述了80余个基于非影像EMR数据(即临床文本和/或结构化数据)训练的基础模型,建立了阐明其架构、训练数据及潜在应用场景的分类体系。研究发现,大多数模型仅在小规模、窄范围临床数据集(如MIMIC-III)或宽泛的公共生物医学语料库(如PubMed)上训练,且其评估任务无法提供对医疗系统有实质意义的洞见。基于这些发现,我们提出了一种改进的评估框架,旨在更紧密地锚定医疗领域的关键指标,以衡量临床基础模型的实际效益。