The effective analysis of high-dimensional Electronic Health Record (EHR) data, with substantial potential for healthcare research, presents notable methodological challenges. Employing predictive modeling guided by a knowledge graph (KG), which enables efficient feature selection, can enhance both statistical efficiency and interpretability. While various methods have emerged for constructing KGs, existing techniques often lack statistical certainty concerning the presence of links between entities, especially in scenarios where the utilization of patient-level EHR data is limited due to privacy concerns. In this paper, we propose the first inferential framework for deriving a sparse KG with statistical guarantee based on the dynamic log-linear topic model proposed by \cite{arora2016latent}. Within this model, the KG embeddings are estimated by performing singular value decomposition on the empirical pointwise mutual information matrix, offering a scalable solution. We then establish entrywise asymptotic normality for the KG low-rank estimator, enabling the recovery of sparse graph edges with controlled type I error. Our work uniquely addresses the under-explored domain of statistical inference about non-linear statistics under the low-rank temporal dependent models, a critical gap in existing research. We validate our approach through extensive simulation studies and then apply the method to real-world EHR data in constructing clinical KGs and generating clinical feature embeddings.
翻译:高维电子健康记录(EHR)数据的有效分析在医疗研究中具有巨大潜力,同时也带来了显著的方法学挑战。采用由知识图谱(KG)引导的预测建模能够实现高效特征选择,从而提升统计效率与可解释性。尽管已有多种方法用于构建知识图谱,但现有技术通常无法提供关于实体间链接存在性的统计确定性,尤其在因隐私问题限制患者级EHR数据使用的场景下。本文首次提出了一个基于动态对数线性主题模型的推断框架,该模型源自\cite{arora2016latent},用于推导具有统计保证的稀疏知识图谱。在该模型中,通过对经验点互信息矩阵进行奇异值分解来估计知识图谱嵌入,从而提供可扩展的解决方案。我们进一步建立了知识图谱低秩估计量的逐项渐近正态性,使得能够在控制第一类错误的情况下恢复稀疏图边。本研究独特地解决了低秩时间依赖模型下非线性统计量的统计推断这一尚未充分探索的领域,弥补了现有研究的关键空白。我们通过大量模拟研究验证了所提方法,并将其应用于真实EHR数据中构建临床知识图谱及生成临床特征嵌入。