Existing survival models either do not scale to high dimensional and multi-modal data or are difficult to interpret. In this study, we present a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Our contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) integrating patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-SurG using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8,211 subjects with 75,187 outpatient claim records of 1,767 unique ICD codes; the MIMIC-III consisting of 1,458 subjects with multi-modal EHR records. Compared to the baselines, MixEHR-SurG achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-SurG associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC- III patients after their ICU discharge. Together, the integration of the Cox proportional hazards model and EHR topic inference in MixEHR-SurG not only leads to competitive mortality prediction but also meaningful phenotype topics for in-depth survival analysis. The software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-SurG.
翻译:现有生存模型要么无法扩展至高维和多模态数据,要么难以解释。本研究提出一种名为MixEHR-SurG的监督主题模型,可同时整合异构电子健康记录(EHR)数据并建模生存风险。我们的贡献包括三重: (1) 将EHR主题推断与Cox比例风险似然整合; (2) 利用PheCode概念引入患者特异性主题超参数,使得每个主题可精确对应一个PheCode相关表型; (3) 多模态生存主题推断。这构建了一个高度可解释的生存主题模型,能够推断与患者死亡风险相关的PheCode特异性表型主题。我们使用模拟数据集及两个真实世界EHR数据集评估MixEHR-SurG:魁北克先天性心脏病(CHD)数据集包含8,211名受试者及75,187条门诊索赔记录(涉及1,767个唯一ICD编码);MIMIC-III数据集包含1,458名受试者的多模态EHR记录。与基线方法相比,MixEHR-SurG在死亡率预测中实现了更优的动态AUROC:模拟数据集平均AUROC得分为0.89,CHD数据集平均AUROC为0.645。定性分析显示,MixEHR-SurG将CHD患者首次心力衰竭住院后的严重心脏状况与高死亡风险相关联,并将MIMIC-III患者重症监护出院后的严重脑损伤与死亡率增加相关联。综上,MixEHR-SurG中Cox比例风险模型与EHR主题推断的整合不仅带来了具有竞争力的死亡率预测,还提供了有意义的表型主题以支持深度生存分析。软件源码见GitHub:https://github.com/li-lab-mcgill/MixEHR-SurG。