In this work we introduce Labrador, a pre-trained Transformer model for laboratory data. Labrador and BERT were pre-trained on a corpus of 100 million lab test results from electronic health records (EHRs) and evaluated on various downstream outcome prediction tasks. Both models demonstrate mastery of the pre-training task but neither consistently outperform XGBoost on downstream supervised tasks. Our ablation studies reveal that transfer learning shows limited effectiveness for BERT and achieves marginal success with Labrador. We explore the reasons for the failure of transfer learning and suggest that the data generating process underlying each patient cannot be characterized sufficiently using labs alone, among other factors. We encourage future work to focus on joint modeling of multiple EHR data categories and to include tree-based baselines in their evaluations.
翻译:本研究介绍了Labrador,一种针对实验室数据的预训练Transformer模型。Labrador与BERT均在包含1亿条电子健康记录(EHR)实验室检测结果的语料库上进行预训练,并在多种下游结果预测任务中接受评估。两种模型均展现出对预训练任务的掌握能力,但在下游监督任务中均未持续超越XGBoost。我们的消融研究表明,迁移学习对BERT效果有限,而对Labrador仅取得边际性成功。我们探讨了迁移学习失效的原因,指出除其他因素外,仅凭实验室数据无法充分表征每位患者背后的数据生成过程。我们建议未来研究应聚焦于多类别EHR数据的联合建模,并在评估中纳入基于树模型的基线方法。