Principal component analysis (PCA) is a foundational tool in modern data analysis, and a crucial step in PCA is selecting the number of components to keep. However, classical selection methods (e.g., scree plots, parallel analysis, etc.) lack statistical guarantees in the increasingly common setting of large-dimensional data with heterogeneous noise, i.e., where each entry may have a different noise variance. Moreover, it turns out that these methods, which are highly effective for homogeneous noise, can fail dramatically for data with heterogeneous noise. This paper proposes a new method called signflip parallel analysis (FlipPA) for the setting of approximately symmetric noise: it compares the data singular values to those of "empirical null" matrices generated by flipping the sign of each entry randomly with probability one-half. We develop a rigorous theory for FlipPA, showing that it has nonasymptotic type I error control and that it consistently selects the correct rank for signals rising above the noise floor in the large-dimensional limit (even when the noise is heterogeneous). We also rigorously explain why classical permutation-based parallel analysis degrades under heterogeneous noise. Finally, we illustrate that FlipPA compares favorably to state-of-the art methods via numerical simulations and an illustration on data coming from astronomy.
翻译:主成分分析(PCA)是现代数据分析的基础工具,其中关键步骤是确定需保留的成分数量。然而,在维度日益增长且具有异质性噪声(即每个数据条目可能具有不同噪声方差)的数据场景中,经典选择方法(如碎石图、平行分析等)缺乏统计保证。研究表明,这些在均匀噪声条件下高度有效的方法,在面对异质性噪声数据时可能出现严重失效。本文针对近似对称噪声场景提出了一种称为符号翻转平行分析的新方法:该方法通过以1/2概率随机翻转每个数据条目的符号生成"经验零矩阵",并将数据奇异值与这些矩阵的奇异值进行比较。我们建立了FlipPA的严格理论框架,证明其具有非渐近型I误差控制能力,且在大维度极限下能一致性地选择高于噪声基底的有效信号秩(即使存在异质性噪声)。同时,我们严格论证了基于经典置换的平行分析为何在异质性噪声下性能退化。最后,通过数值模拟和天文学数据实例,我们证明FlipPA相较于现有前沿方法具有显著优势。