Graphical models are an important tool in exploring relationships between variables in complex, multivariate data. Methods for learning such graphical models are well developed in the case where all variables are either continuous or discrete, including in high-dimensions. However, in many applications data span variables of different types (e.g. continuous, count, binary, ordinal, etc.), whose principled joint analysis is nontrivial. Latent Gaussian copula models, in which all variables are modeled as transformations of underlying jointly Gaussian variables, represent a useful approach. Recent advances have shown how the binary-continuous case can be tackled, but the general mixed variable type regime remains challenging. In this work, we make the simple yet useful observation that classical ideas concerning polychoric and polyserial correlations can be leveraged in a latent Gaussian copula framework. Building on this observation we propose flexible and scalable methodology for data with variables of entirely general mixed type. We study the key properties of the approaches theoretically and empirically, via extensive simulations as well an illustrative application to data from the UK Biobank concerning COVID-19 risk factors.
翻译:图模型是探索复杂多变量数据中变量间关系的重要工具。当所有变量均为连续或离散类型时(包括高维情形),学习此类图模型的方法已相当成熟。然而在许多实际应用中,数据涵盖不同类型变量(如连续型、计数型、二元型、有序型等),对其进行有原则的联合分析颇具挑战性。潜高斯连接函数模型将各变量建模为潜在联合高斯变量的变换形式,是一种有效方法。近年研究已展示如何处理二元-连续混合情形,但通用混合变量类型仍充满挑战。本文提出一个简洁而实用的观察:在多分格相关系数与多序列相关系数的经典思想可被应用于潜高斯连接函数框架。基于此观察,我们为完全通用混合类型的数据变量构建了灵活且可扩展的方法体系。通过大量模拟实验及对英国生物银行COVID-19风险因素数据的示例性应用,我们从理论与实证角度系统研究了该方法的核心性质。