Electronic Health Records (EHRs) enable deep learning for clinical predictions, but the optimal method for representing patient data remains unclear due to inconsistent evaluation practices. We present the first systematic benchmark to compare EHR representation methods, including multivariate time-series, event streams, and textual event streams for LLMs. This benchmark standardises data curation and evaluation across two distinct clinical settings: the MIMIC-IV dataset for ICU tasks (mortality, phenotyping) and the EHRSHOT dataset for longitudinal care (30-day readmission, 1-year pancreatic cancer). For each paradigm, we evaluate appropriate modelling families--including Transformers, MLP, LSTMs and Retain for time-series, CLMBR and count-based models for event streams, 8-20B LLMs for textual streams--and analyse the impact of feature pruning based on data missingness. Our experiments reveal that event stream models consistently deliver the strongest performance. Pre-trained models like CLMBR are highly sample-efficient in few-shot settings, though simpler count-based models can be competitive given sufficient data. Furthermore, we find that feature selection strategies must be adapted to the clinical setting: pruning sparse features improves ICU predictions, while retaining them is critical for longitudinal tasks. Our results, enabled by a unified and reproducible pipeline, provide practical guidance for selecting EHR representations based on the clinical context and data regime.
翻译:电子健康记录(EHRs)为临床预测的深度学习提供了可能,但由于评估实践的不一致性,患者数据的最佳表示方法仍不明确。我们提出了首个系统性基准,用于比较EHR表示方法,包括用于LLMs的多变量时间序列、事件流和文本事件流。该基准在两个不同的临床场景中标准化了数据整理和评估流程:用于ICU任务(死亡率、表型分类)的MIMIC-IV数据集和用于纵向护理(30天再入院、1年胰腺癌)的EHRSHOT数据集。针对每种范式,我们评估了相应的建模家族——包括用于时间序列的Transformers、MLP、LSTMs和Retain,用于事件流的CLMBR和基于计数的模型,以及用于文本流的8-20B LLMs——并分析了基于数据缺失性的特征剪枝的影响。我们的实验表明,事件流模型始终提供最强的性能。像CLMBR这样的预训练模型在少样本设置中具有很高的样本效率,尽管在数据充足的情况下,更简单的基于计数的模型也可能具有竞争力。此外,我们发现特征选择策略必须适应临床场景:剪枝稀疏特征可改善ICU预测,而保留这些特征对于纵向任务至关重要。我们的结果通过一个统一且可复现的流程实现,为基于临床背景和数据状况选择EHR表示提供了实用指导。