Objective: To improve survival analysis using EHR data, we aim to develop a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Materials and Methods: Our technical contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) inferring patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival and guided topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-G using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8,211 subjects with 75,187 outpatient claim data of 1,767 unique ICD codes; the MIMIC-III consisting of 1,458 subjects with multi-modal EHR records. Results: Compared to the baselines, MixEHR-G achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-G associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC-III patients after their ICU discharge. Conclusion: The integration of the Cox proportional hazards model and EHR topic inference in MixEHR-SurG led to not only competitive mortality prediction but also meaningful phenotype topics for systematic survival analysis. The software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-SurG.
翻译:目的:为了利用电子健康记录(EHR)数据改进生存分析,我们旨在开发一种名为MixEHR-SurG的监督式主题模型,以同时整合异质性EHR数据并建模生存风险。材料与方法:我们的技术贡献体现在三个方面:(1)将EHR主题推断与Cox比例风险似然函数相结合;(2)利用PheCode概念推断患者特异性主题超参数,使得每个主题可精确对应一种PheCode关联表型;(3)多模态生存主题推断。这构建了一个高度可解释的生存与引导主题模型,能够推断与患者死亡率相关的PheCode特异性表型主题。我们使用模拟数据集和两个真实世界EHR数据集评估了MixEHR-SurG:魁北克先天性心脏病(CHD)数据集包含8,211名受试者、75,187条门诊索赔记录及1,767个唯一ICD代码;MIMIC-III数据集包含1,458名受试者的多模态EHR记录。结果:与基线方法相比,MixEHR-SurG在死亡率预测中实现了更优的动态AUROC,在模拟数据集中平均AUROC得分为0.89,在CHD数据集中平均AUROC为0.645。定性分析显示,MixEHR-SurG将首次心衰住院后CHD患者的严重心脏状况与高死亡风险关联,并将重症脑损伤与MIMIC-III患者ICU出院后的死亡率增加关联。结论:MixEHR-SurG中Cox比例风险模型与EHR主题推断的整合不仅实现了具有竞争力的死亡率预测,还为系统性生存分析提供了有意义的表型主题。软件代码见GitHub:https://github.com/li-lab-mcgill/MixEHR-SurG。