Detailed phenotype information is fundamental to accurate diagnosis and risk estimation of diseases. As a rich source of phenotype information, electronic health records (EHRs) promise to empower diagnostic variant interpretation. However, how to accurately and efficiently extract phenotypes from the heterogeneous EHR data remains a challenge. In this work, we present PheME, an Ensemble framework using Multi-modality data of structured EHRs and unstructured clinical notes for accurate Phenotype prediction. Firstly, we employ multiple deep neural networks to learn reliable representations from the sparse structured EHR data and redundant clinical notes. A multi-modal model then aligns multi-modal features onto the same latent space to predict phenotypes. Secondly, we leverage ensemble learning to combine outputs from single-modal models and multi-modal models to improve phenotype predictions. We choose seven diseases to evaluate the phenotyping performance of the proposed framework. Experimental results show that using multi-modal data significantly improves phenotype prediction in all diseases, the proposed ensemble learning framework can further boost the performance.
翻译:详细的表型信息对于疾病的准确诊断和风险评估至关重要。作为表型信息的丰富来源,电子健康记录(EHRs)有望赋能诊断性变异解释。然而,如何从异质性EHR数据中准确高效地提取表型仍是一个挑战。本研究提出PheME——一种利用结构化EHR和非结构化临床笔记等多模态数据进行准确表型预测的集成框架。首先,我们采用多个深度神经网络从稀疏的结构化EHR数据和冗余的临床笔记中学习可靠表征,随后通过多模态模型将多模态特征对齐至同一潜在空间以预测表型。其次,我们利用集成学习整合单模态模型与多模态模型的输出,从而提升表型预测效果。选取七种疾病评估该框架的表型预测性能,实验结果表明,使用多模态数据显著改善了所有疾病的表型预测,而所提出的集成学习框架可进一步提升性能。