MixEHR-Nest: Identifying Subphenotypes within Electronic Health Records through Hierarchical Guided-Topic Modeling

Automatic subphenotyping from electronic health records (EHRs)provides numerous opportunities to understand diseases with unique subgroups and enhance personalized medicine for patients. However, existing machine learning algorithms either focus on specific diseases for better interpretability or produce coarse-grained phenotype topics without considering nuanced disease patterns. In this study, we propose a guided topic model, MixEHR-Nest, to infer sub-phenotype topics from thousands of disease using multi-modal EHR data. Specifically, MixEHR-Nest detects multiple subtopics from each phenotype topic, whose prior is guided by the expert-curated phenotype concepts such as Phenotype Codes (PheCodes) or Clinical Classification Software (CCS) codes. We evaluated MixEHR-Nest on two EHR datasets: (1) the MIMIC-III dataset consisting of over 38 thousand patients from intensive care unit (ICU) from Beth Israel Deaconess Medical Center (BIDMC) in Boston, USA; (2) the healthcare administrative database PopHR, comprising 1.3 million patients from Montreal, Canada. Experimental results demonstrate that MixEHR-Nest can identify subphenotypes with distinct patterns within each phenotype, which are predictive for disease progression and severity. Consequently, MixEHR-Nest distinguishes between type 1 and type 2 diabetes by inferring subphenotypes using CCS codes, which do not differentiate these two subtype concepts. Additionally, MixEHR-Nest not only improved the prediction accuracy of short-term mortality of ICU patients and initial insulin treatment in diabetic patients but also revealed the contributions of subphenotypes. For longitudinal analysis, MixEHR-Nest identified subphenotypes of distinct age prevalence under the same phenotypes, such as asthma, leukemia, epilepsy, and depression. The MixEHR-Nest software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-Nest.

翻译：从电子健康记录（EHR）中自动识别亚表型为理解具有独特亚组的疾病和增强患者的个性化医疗提供了众多机会。然而，现有的机器学习算法要么专注于特定疾病以获得更好的可解释性，要么生成粗粒度的表型主题而未考虑细微的疾病模式。在本研究中，我们提出了一种引导主题模型MixEHR-Nest，利用多模态EHR数据从数千种疾病中推断亚表型主题。具体而言，MixEHR-Nest从每个表型主题中检测多个子主题，其先验由专家整理的表型概念（如表型代码（PheCodes）或临床分类软件（CCS）代码）引导。我们在两个EHR数据集上评估了MixEHR-Nest：（1）MIMIC-III数据集，包含来自美国波士顿贝斯以色列女执事医疗中心（BIDMC）重症监护室（ICU）的超过3.8万名患者；（2）医疗保健管理数据库PopHR，包含来自加拿大蒙特利尔的130万名患者。实验结果表明，MixEHR-Nest能够在每个表型内识别出具有不同模式的亚表型，这些亚表型对疾病进展和严重程度具有预测性。因此，MixEHR-Nest通过使用CCS代码推断亚表型，区分了1型和2型糖尿病，而CCS代码本身并不区分这两种亚型概念。此外，MixEHR-Nest不仅提高了ICU患者短期死亡率和糖尿病患者初始胰岛素治疗的预测准确性，还揭示了亚表型的贡献。在纵向分析中，MixEHR-Nest识别了相同表型（如哮喘、白血病、癫痫和抑郁症）下具有不同年龄流行特征的亚表型。MixEHR-Nest软件可在GitHub获取：https://github.com/li-lab-mcgill/MixEHR-Nest。