With the increasing availability of data objects in the form of probability distributions, there is a growing need for statistical methods tailored to distributional data. Distance measures, especially the pairwise distance matrix between data objects, provide the foundation for a wide range of modern data analysis methods, such as clustering, multidimensional scaling, and distance-based regression, among others. The Wasserstein distance is commonly used with distributional data due to its compelling optimal transport property. However, while the Wasserstein distance can be efficiently computed for univariate distributions, its application to multivariate distributions is limited due to high computational costs. To address these scalability issues, we introduce the Nonparanormal Transport (NPT) metric, a closed-form distance based on the flexible nonparanormal distribution family for modeling skewed and non-Gaussian multivariate data. Simulation studies demonstrate that NPT maintains a high level of agreement with the Wasserstein distance, while being at least 1000 times faster than its efficient variants when computing a 100-distribution pairwise distance matrix in both 2 and 5 dimensions. We illustrate the utility of NPT through a multidimensional scaling analysis of bivariate oxygen desaturation distributions of 723 individuals with sleep apnea in the Sleep Heart Health Study.
翻译:随着以概率分布形式呈现的数据对象日益增多,对适用于分布数据的统计方法的需求也在不断增长。距离度量,特别是数据对象间的成对距离矩阵,为聚类分析、多维尺度分析及基于距离的回归等众多现代数据分析方法提供了基础。Wasserstein距离因其最优传输特性而常用于分布数据。然而,尽管Wasserstein距离在单变量分布中可高效计算,但由于高昂的计算成本,其在多元分布中的应用受到限制。为解决这些可扩展性问题,我们提出了非参数正态传输(NPT)度量——一种基于灵活非参数正态分布族的闭式距离度量,适用于建模偏态与非高斯多元数据。仿真研究表明,在计算100个分布的成对距离矩阵时(维度为2维和5维),NPT与Wasserstein距离保持高度一致性,同时计算速度比其高效变体快至少1000倍。我们通过对睡眠心脏健康研究中723名睡眠呼吸暂停患者的双变量血氧饱和度分布进行多维尺度分析,展示了NPT的实际应用价值。