Selecting the number of components in PCA via random signflips

Principal component analysis (PCA) is a foundational tool in modern data analysis, and a crucial step in PCA is selecting the number of components to keep. However, classical selection methods (e.g., scree plots, parallel analysis, etc.) lack statistical guarantees in the increasingly common setting of large-dimensional data with heterogeneous noise, i.e., where each entry may have a different noise variance. Moreover, it turns out that these methods, which are highly effective for homogeneous noise, can fail dramatically for data with heterogeneous noise. This paper proposes a new method called signflip parallel analysis (FlipPA) for the setting of approximately symmetric noise: it compares the data singular values to those of "empirical null" matrices generated by flipping the sign of each entry randomly with probability one-half. We develop a rigorous theory for FlipPA, showing that it has nonasymptotic type I error control and that it consistently selects the correct rank for signals rising above the noise floor in the large-dimensional limit (even when the noise is heterogeneous). We also rigorously explain why classical permutation-based parallel analysis degrades under heterogeneous noise. Finally, we illustrate that FlipPA compares favorably to state-of-the art methods via numerical simulations and an illustration on data coming from astronomy.

翻译：主成分分析（PCA）是现代数据分析的基础工具，其中关键步骤是确定需保留的成分数量。然而，在维度日益增长且具有异质性噪声（即每个数据条目可能具有不同噪声方差）的数据场景中，经典选择方法（如碎石图、平行分析等）缺乏统计保证。研究表明，这些在均匀噪声条件下高度有效的方法，在面对异质性噪声数据时可能出现严重失效。本文针对近似对称噪声场景提出了一种称为符号翻转平行分析的新方法：该方法通过以1/2概率随机翻转每个数据条目的符号生成"经验零矩阵"，并将数据奇异值与这些矩阵的奇异值进行比较。我们建立了FlipPA的严格理论框架，证明其具有非渐近型I误差控制能力，且在大维度极限下能一致性地选择高于噪声基底的有效信号秩（即使存在异质性噪声）。同时，我们严格论证了基于经典置换的平行分析为何在异质性噪声下性能退化。最后，通过数值模拟和天文学数据实例，我们证明FlipPA相较于现有前沿方法具有显著优势。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【AI应用】Facebook-利用神经网络求解高等数学方程, Using neural networks to solve advanced mathematics equations

专知会员服务

34+阅读 · 2020年1月15日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日