Electronic health records (EHR) store hundreds of demographic and laboratory variables from large patient populations. Traditional statistical methods have limited capacity in processing mixed-type data (continuous, ordinal) and capturing non-linear relationships in large multivariate data when oversimplified assumptions are made about the distribution (e.g., Gaussian) of disparate variables in EHR data. This paper addresses the limitations mentioned above by repurposing the vine copula method, which is primarily used to synthesize a multivariate distribution from many bivariate cumulative distribution functions (copulas). Vine copulas produce tree structures that represent bivariate conditional dependencies at varying hierarchical levels, decomposing a multivariate distribution. The tree structure is used to rank variables by conditional dependence and to identify a subset of central variables with local dependence, thus simplifying probabilistic mining of high-dimensional EHR data. The proposed application of vine copulas is used to identify conditional dependence between co-morbid conditions and is validated for characterizing different cohorts of EHR patients. The contribution of this paper is a novel approach to probabilistic mining and exploration of healthcare data that provides data-driven explanations, visualization, and variable selection to prognosticate a healthcare outcome. The source code is shared publicly.
翻译:电子健康档案(EHR)存储着大量患者群体的人口统计学和实验室变量。传统统计方法在处理混合类型数据(连续型、有序型)及捕捉大规模多变量数据中的非线性关系时能力有限,因为其对EHR数据中不同变量分布(如高斯分布)做了过度简化的假设。本文通过重新利用藤蔓连接函数方法来解决上述局限性,该方法主要用于从多个二元累积分布函数(连接函数)中合成多变量分布。藤蔓连接函数生成树状结构,在不同层次上表示二元条件依赖关系,从而分解多变量分布。利用该树状结构可依据条件依赖程度对变量进行排序,并识别具有局部依赖关系的关键变量子集,从而简化高维EHR数据的概率挖掘。本文提出的藤蔓连接函数应用方法用于识别共病条件之间的条件依赖关系,并通过表征不同EHR患者队列进行验证。本文贡献在于提出了一种医疗数据概率挖掘与探索的新方法,能为医疗结果预测提供数据驱动的解释、可视化及变量选择。源代码已公开共享。