While the general machine learning (ML) community has benefited from public datasets, tasks, and models, the progress of ML in healthcare has been hampered by a lack of such shared assets. The success of foundation models creates new challenges for healthcare ML by requiring access to shared pretrained models to validate performance benefits. We help address these challenges through three contributions. First, we publish a new dataset, EHRSHOT, containing de-identified structured data from the electronic health records (EHRs) of 6,712 patients from Stanford Medicine. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and not restricted to ICU/ED patients. Second, we publish the weights of a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. We are one of the first to fully release such a model for coded EHR data; in contrast, most prior models released for clinical data (e.g. GatorTron, ClinicalBERT) only work with unstructured text and cannot process the rich, structured data within an EHR. We provide an end-to-end pipeline for the community to validate and build upon its performance. Third, we define 15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaption. The code to reproduce our results, as well as the model and dataset (via a research data use agreement), are available at our Github repo here: https://github.com/som-shahlab/ehrshot-benchmark
翻译:尽管通用机器学习社区受益于公开数据集、任务和模型,但医疗机器学习领域的进展却因缺乏此类共享资源而受阻。基础模型成功带来新挑战,要求医疗机器学习须访问共享预训练模型以验证性能优势。我们通过三项贡献帮助应对这些挑战。首先,我们发布新数据集EHRSHOT,包含来自斯坦福医学中心6,712名患者电子健康记录的去标识化结构化数据。与MIMIC-III/IV等其他流行EHR数据集不同,EHRSHOT为纵向数据且不局限于ICU/ED患者。其次,我们发布一个含1.41亿参数的临床基础模型的权重,该模型基于257万患者的结构化EHR数据进行预训练。我们是首批完整发布此类编码EHR数据模型的团队之一;相比之下,先前发布的多数临床数据模型(如GatorTron、ClinicalBERT)仅处理非结构化文本,无法处理EHR中丰富的结构化数据。我们提供端到端流水线,供社区验证并提升其性能。第三,我们定义15项少样本临床预测任务,从而可在样本效率与任务适应等性能优势方面评估基础模型。用于重现结果的代码、模型及数据集(需签署研究数据使用协议)均可在我们GitHub仓库获取:https://github.com/som-shahlab/ehrshot-benchmark