Principal component analysis (PCA) is a longstanding and well-studied approach for dimension reduction. It rests upon the assumption that the underlying signal in the data has low rank, and thus can be well-summarized using a small number of dimensions. The output of PCA is typically represented using a scree plot, which displays the proportion of variance explained (PVE) by each principal component. While the PVE is extensively reported in routine data analyses, to the best of our knowledge the notion of inference on the PVE remains unexplored. In this paper, we consider inference on the PVE. We first introduce a new population quantity for the PVE with respect to an unknown matrix mean. Critically, our interest lies in the PVE of the sample principal components (as opposed to unobserved population principal components); thus, the population PVE that we introduce is defined conditional on the sample singular vectors. We show that it is possible to conduct inference, in the sense of confidence intervals, p-values, and point estimates, on this population quantity. Furthermore, we can conduct valid inference on the PVE of a subset of the principal components, even when the subset is selected using a data-driven approach such as the elbow rule. We demonstrate the proposed approach in simulation and in an application to a gene expression dataset.
翻译:主成分分析(PCA)是一种历史悠久且被深入研究的降维方法。其基于数据中潜在信号具有低秩性的假设,因此可以通过少量维度得到良好概括。PCA的输出通常通过碎石图表示,该图展示了每个主成分所解释的方差比例。尽管方差解释比例在常规数据分析中被广泛报告,但据我们所知,关于方差解释比例的推断概念仍未被探索。本文研究了方差解释比例的推断问题。我们首先针对未知矩阵均值引入了一个新的总体方差解释比例度量。关键之处在于,我们关注的是样本主成分(而非未观测的总体主成分)的方差解释比例;因此,我们提出的总体方差解释比例是在给定样本奇异向量的条件下定义的。我们证明,可以针对该总体度量进行置信区间、p值和点估计等形式的统计推断。此外,即使主成分子集是通过数据驱动方法(如肘部法则)选择的,我们仍能对该子集的方差解释比例进行有效推断。我们通过模拟实验和基因表达数据集的应用验证了所提出的方法。