While the general machine learning (ML) community has benefited from public datasets, tasks, and models, the progress of ML in healthcare has been hampered by a lack of such shared assets. The success of foundation models creates new challenges for healthcare ML by requiring access to shared pretrained models to validate performance benefits. We help address these challenges through three contributions. First, we publish a new dataset, EHRSHOT, which contains deidentified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and not restricted to ICU/ED patients. Second, we publish the weights of CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. We are one of the first to fully release such a model for coded EHR data; in contrast, most prior models released for clinical data (e.g. GatorTron, ClinicalBERT) only work with unstructured text and cannot process the rich, structured data within an EHR. We provide an end-to-end pipeline for the community to validate and build upon its performance. Third, we define 15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaptation. Our model and dataset are available via a research data use agreement from the Stanford AIMI Center. Code to reproduce our results are available at our Github repo: https://github.com/som-shahlab/ehrshot-benchmark
翻译:尽管通用机器学习(ML)社区受益于公共数据集、任务和模型,但ML在医疗领域的进展因缺乏此类共享资源而受阻。基础模型(Foundation Models)的成功对医疗ML提出了新挑战,需要获取共享的预训练模型以验证其性能优势。我们通过三项贡献助力解决这些挑战。首先,我们发布了一个新数据集EHRSHOT,包含来自斯坦福医学院6,739名患者电子健康记录(EHR)的去标识结构化数据。与MIMIC-III/IV及其他流行EHR数据集不同,EHRSHOT是纵向的,且不局限于ICU/ED患者。其次,我们发布了CLMBR-T-base的权重——一个基于257万患者结构化EHR数据预训练的1.41亿参数临床基础模型。我们是首批完整发布此类用于编码EHR数据的模型的团队之一;相比之下,多数先前发布的临床数据模型(如GatorTron、ClinicalBERT)仅支持非结构化文本,无法处理EHR中丰富的结构化数据。我们提供端到端流程,供社区验证并提升其性能。第三,我们定义了15个少样本临床预测任务,支持对基础模型在样本效率、任务适应等优势方面的评估。我们的模型和数据集可通过斯坦福AIMI中心的研究数据使用协议获取。复现结果的代码已发布在GitHub仓库:https://github.com/som-shahlab/ehrshot-benchmark