While large-scale pretraining has revolutionized language modeling, its potential remains underexplored in healthcare with structured electronic health records (EHRs). We present RAVEN, a novel generative pretraining strategy for sequential EHR data based on Recurrence-Aware next-Visit EveNt prediction. Leveraging a dataset of over one million unique individuals, our model learns to autoregressively generate tokenized clinical events for the next visit conditioned on patient history. We introduce regularization on predicting repeated events and highlight a key pitfall in EHR-based foundation model evaluations: repeated event tokens can inflate performance metrics when new onsets are not distinguished from subsequent occurrences. Furthermore, we empirically investigate the scaling behaviors in a data-constrained, compute-saturated regime, showing that simply increasing model size is suboptimal without commensurate increases in data volume. We evaluate our model via zero-shot prediction for forecasting the incidence of a diverse set of diseases, where it rivals fully fine-tuned representation-based Transformer models and outperforms widely used simulation-based next-token approaches. Finally, without additional parameter updates, we show that RAVEN can generalize to an external patient cohort under lossy clinical code mappings and feature coverage gaps.
翻译:尽管大规模预训练已彻底变革语言建模,但其在结构化电子健康记录(EHR)领域的潜力仍未充分开发。我们提出RAVEN,一种基于递归感知下次就诊事件预测的序列化EHR数据生成式预训练新策略。依托包含超百万独立个体的数据集,该模型通过自回归方式学习生成患者历史条件约束下的下次就诊临床事件令牌化序列。我们引入针对重复事件预测的正则化方法,并揭示EHR基础模型评估的关键陷阱:当新发病灶与后续事件未加区分时,重复事件令牌可能虚增性能指标。此外,我们通过实证研究数据受限与计算饱和场景下的扩展行为,表明在数据量未同步增长条件下单纯增加模型规模并非最优方案。通过零样本预测对多种疾病发病率进行预测评估,该模型达到与全微调表示型Transformer模型相当的性能,并优于广泛使用的仿真型下一令牌方法。最终,即使面对有损临床编码映射与特征覆盖缺口,RAVEN在无需额外参数更新的条件下即可泛化至外部患者队列。