Generative models trained using self-supervision of tokenized electronic health record (EHR) timelines show promise for clinical outcome prediction. This is typically done using Monte Carlo simulation for future patient trajectories. However, existing approaches suffer from three key limitations: sparse estimate distributions that poorly differentiate patient risk levels, extreme computational costs, and high sampling variance. We propose two new estimators: the Sum of Conditional Outcome Probability Estimator (SCOPE) and Risk Estimation from Anticipated Conditional Hazards (REACH), that leverage next-token probability distributions discarded by standard Monte Carlo. We prove both estimators are unbiased and that REACH guarantees variance reduction over Monte Carlo sampling for any model and outcome. Empirically, on hospital mortality prediction in MIMIC-IV using the ETHOS-ARES framework, SCOPE and REACH match 100-sample Monte Carlo performance using only 10-11 samples (95% CI: [9,11]), representing a ~10x reduction in inference cost without degrading calibration. For ICU admission prediction, efficiency gains are more modest (~1.2x), which we attribute to the outcome's lower "spontaneity," a property we characterize theoretically and empirically. These methods substantially improve the feasibility of deploying generative EHR models in resource-constrained clinical settings.
翻译:通过电子健康记录(EHR)时间轴标记的自监督训练生成的生成模型,在临床结局预测中展现出潜力。此类预测通常采用蒙特卡洛模拟来生成未来患者轨迹。然而,现有方法存在三个关键局限:估计分布稀疏导致患者风险区分度不足、计算成本极高以及采样方差过大。本文提出两种新型估计器:条件结局概率求和估计器(SCOPE)与基于预期条件风险的风险估计器(REACH)。这两种估计器利用了标准蒙特卡洛方法所丢弃的下一标记概率分布。我们证明两种估计器均具有无偏性,且REACH能在任意模型与结局条件下保证较蒙特卡洛采样的方差降低。在MIMIC-IV数据集上采用ETHOS-ARES框架进行医院死亡率预测的实证研究表明,SCOPE与REACH仅需10-11个样本(95%置信区间:[9,11])即可达到100样本蒙特卡洛模拟的预测性能,这意味着在不降低校准度的前提下实现了约10倍的推理成本缩减。对于ICU入院预测,效率提升较为有限(约1.2倍),我们将其归因于该结局较低的“自发性”特征——这一性质我们通过理论与实证分析进行了刻画。所提方法显著提升了生成式EHR模型在资源受限临床环境中部署的可行性。