Principal variables analysis (PVA) is a technique for selecting a subset of variables that capture as much of the information in a dataset as possible. Existing approaches for PVA are based on the Pearson correlation matrix, which is not well-suited to describing the relationships between non-Gaussian variables. We propose a generalized approach to PVA enabling the use of different types of correlation, and we explore using Spearman, Gaussian copula, and polychoric correlations as alternatives to Pearson correlation when performing PVA. We compare performance in simulation studies varying the form of the true multivariate distribution over a wide range of possibilities. Our results show that on continuous non-Gaussian data, using generalized PVA with Gaussian copula or Spearman correlations provides a major improvement in performance compared to Pearson. Meanwhile, on ordinal data, generalized PVA with polychoric correlations outperforms the rest by a wide margin. We apply generalized PVA to a dataset of 102 clinical variables measured on individuals with X-linked dystonia parkinsonism (XDP), a rare neurodegenerative disorder, and we find that using different types of correlation yields substantively different sets of principal variables.
翻译:主变量分析是一种从数据集中选择能够最大程度捕获其中信息的变量子集的技术。现有的主变量分析方法基于皮尔逊相关矩阵,但该矩阵并不适合描述非高斯变量之间的关系。我们提出了一种广义的主变量分析方法,能够灵活使用不同类型的相关性度量,并探索了在执行主变量分析时使用斯皮尔曼相关、高斯连接函数相关和多项相关作为皮尔逊相关的替代方案。我们通过仿真研究,在真实多元分布形式广泛变化的情境下比较了这些方法的性能。结果表明:在连续非高斯数据上,使用高斯连接函数相关或斯皮尔曼相关的广义主变量分析与皮尔逊相关相比性能显著提升;而在有序数据上,使用多项相关的广义主变量分析方法则大幅优于其他方法。我们将广义主变量分析应用于一组包含102个临床变量的数据集(该数据源于X连锁肌张力障碍帕金森综合征患者——一种罕见的神经退行性疾病),发现采用不同类型的相关性度量会产生实质上不同的主变量集合。