Inference on the proportion of variance explained in principal component analysis

Principal component analysis (PCA) is a longstanding and well-studied approach for dimension reduction. It rests upon the assumption that the underlying signal in the data has low rank, and thus can be well-summarized using a small number of dimensions. The output of PCA is typically represented using a scree plot, which displays the proportion of variance explained (PVE) by each principal component. While the PVE is extensively reported in routine data analyses, to the best of our knowledge the notion of inference on the PVE remains unexplored. In this paper, we consider inference on the PVE. We first introduce a new population quantity for the PVE with respect to an unknown matrix mean. Critically, our interest lies in the PVE of the sample principal components (as opposed to unobserved population principal components); thus, the population PVE that we introduce is defined conditional on the sample singular vectors. We show that it is possible to conduct inference, in the sense of confidence intervals, p-values, and point estimates, on this population quantity. Furthermore, we can conduct valid inference on the PVE of a subset of the principal components, even when the subset is selected using a data-driven approach such as the elbow rule. We demonstrate the proposed approach in simulation and in an application to a gene expression dataset.

翻译：主成分分析（PCA）是一种历史悠久且研究充分的降维方法。它基于数据中潜在信号具有低秩的假设，因此可以通过少量维度进行有效概括。PCA的输出通常使用碎石图表示，该图展示了每个主成分的解释方差比例（PVE）。尽管PVE在常规数据分析中被广泛报告，但据我们所知，关于PVE的统计推断概念尚未被探索。本文研究了PVE的推断问题。首先，我们针对未知矩阵均值引入了一个新的PVE总体量。关键在于，我们关注的是样本主成分（而非未观测的总体主成分）的PVE；因此，我们引入的总体PVE是在样本奇异向量条件下的定义。我们证明，可以基于置信区间、p值和点估计对该总体量进行推断。此外，即使主成分子集是通过数据驱动方法（如肘部法则）选取的，我们也能对其PVE进行有效推断。我们通过模拟实验及基因表达数据集的应用展示了所提出的方法。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【AI应用】Facebook-利用神经网络求解高等数学方程, Using neural networks to solve advanced mathematics equations

专知会员服务

34+阅读 · 2020年1月15日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日