Principal component analysis (PCA) is a longstanding and well-studied approach for dimension reduction. It rests upon the assumption that the underlying signal in the data has low rank, and thus can be well-summarized using a small number of dimensions. The output of PCA is typically represented using a scree plot, which displays the proportion of variance explained (PVE) by each principal component. While the PVE is extensively reported in routine data analyses, to the best of our knowledge the notion of inference on the PVE remains unexplored. In this paper, we consider inference on the PVE. We first introduce a new population quantity for the PVE with respect to an unknown matrix mean. Critically, our interest lies in the PVE of the sample principal components (as opposed to unobserved population principal components); thus, the population PVE that we introduce is defined conditional on the sample singular vectors. We show that it is possible to conduct inference, in the sense of confidence intervals, p-values, and point estimates, on this population quantity. Furthermore, we can conduct valid inference on the PVE of a subset of the principal components, even when the subset is selected using a data-driven approach such as the elbow rule. We demonstrate the proposed approach in simulation and in an application to a gene expression dataset.
翻译:主成分分析(PCA)是一种历史悠久且研究充分的降维方法。它基于数据中潜在信号具有低秩的假设,因此可以通过少量维度进行有效概括。PCA的输出通常使用碎石图表示,该图展示了每个主成分的解释方差比例(PVE)。尽管PVE在常规数据分析中被广泛报告,但据我们所知,关于PVE的统计推断概念尚未被探索。本文研究了PVE的推断问题。首先,我们针对未知矩阵均值引入了一个新的PVE总体量。关键在于,我们关注的是样本主成分(而非未观测的总体主成分)的PVE;因此,我们引入的总体PVE是在样本奇异向量条件下的定义。我们证明,可以基于置信区间、p值和点估计对该总体量进行推断。此外,即使主成分子集是通过数据驱动方法(如肘部法则)选取的,我们也能对其PVE进行有效推断。我们通过模拟实验及基因表达数据集的应用展示了所提出的方法。