Knowledge Graph Embedding with Electronic Health Records Data via Latent Graphical Block Model

Due to the increasing adoption of electronic health records (EHR), large scale EHRs have become another rich data source for translational clinical research. Despite its potential, deriving generalizable knowledge from EHR data remains challenging. First, EHR data are generated as part of clinical care with data elements too detailed and fragmented for research. Despite recent progress in mapping EHR data to common ontology with hierarchical structures, much development is still needed to enable automatic grouping of local EHR codes to meaningful clinical concepts at a large scale. Second, the total number of unique EHR features is large, imposing methodological challenges to derive reproducible knowledge graph, especially when interest lies in conditional dependency structure. Third, the detailed EHR data on a very large patient cohort imposes additional computational challenge to deriving a knowledge network. To overcome these challenges, we propose to infer the conditional dependency structure among EHR features via a latent graphical block model (LGBM). The LGBM has a two layer structure with the first providing semantic embedding vector (SEV) representation for the EHR features and the second overlaying a graphical block model on the latent SEVs. The block structures on the graphical model also allows us to cluster synonymous features in EHR. We propose to learn the LGBM efficiently, in both statistical and computational sense, based on the empirical point mutual information matrix. We establish the statistical rates of the proposed estimators and show the perfect recovery of the block structure. Numerical results from simulation studies and real EHR data analyses suggest that the proposed LGBM estimator performs well in finite sample.

翻译：随着电子健康记录（EHR）的日益普及，大规模EHR数据已成为转化型临床研究的另一个丰富数据源。尽管潜力巨大，但从EHR数据中推导出可推广的知识仍面临挑战。首先，EHR数据作为临床护理的产物而生成，其数据元素过于细化且碎片化，难以直接用于研究。尽管近年来在通过层次结构将EHR数据映射到通用本体方面取得了进展，但实现将本地EHR代码大规模自动分组为有意义的临床概念仍需大量开发。其次，唯一EHR特征的总量庞大，这为推导可复现的知识图谱（尤其是当关注条件依赖结构时）带来了方法论上的挑战。第三，覆盖大量患者队列的详细EHR数据对推导知识网络提出了额外的计算挑战。为克服这些挑战，我们提出通过潜变量分块图模型（LGBM）推断EHR特征间的条件依赖结构。LGBM具有两层结构：第一层为EHR特征提供语义嵌入向量（SEV）表示，第二层在潜变量SEV上叠加一个图分块模型。图模型上的分块结构还允许我们对EHR中的同义特征进行聚类。我们建议基于经验点互信息矩阵，在统计和计算意义上高效学习LGBM。我们建立了所提估计量的统计速率，并证明了分块结构的完美恢复。来自模拟研究和真实EHR数据分析的数值结果表明，所提LGBM估计量在有限样本下表现良好。