Dimensionality reduction via PCA and factor analysis is an important tool of data analysis. A critical step is selecting the number of components. However, existing methods (such as the scree plot, likelihood ratio, parallel analysis, etc) do not have statistical guarantees in the increasingly common setting where the data are heterogeneous. There each noise entry can have a different distribution. To address this problem, we propose the Signflip Parallel Analysis (Signflip PA) method: it compares data singular values to those of "empirical null" data generated by flipping the sign of each entry randomly with probability one-half. We show that Signflip PA consistently selects factors above the noise level in high-dimensional signal-plus-noise models (including spiked models and factor models) under heterogeneous settings. Here classical parallel analysis is no longer effective. To do this, we rely on recent results in random matrix theory, such as dimension-free operator norm bounds [Latala et al, 2018, Inventiones Mathematicae], and large deviations for the top eigenvalues of nonhomogeneous matrices [Husson, 2020]. We also illustrate that Signflip PA performs well in numerical simulations and on empirical data examples.
翻译:主成分分析(PCA)和因子分析中的降维是数据分析的重要工具。其中关键步骤是选择成分数量。然而,现有方法(如碎石图、似然比、平行分析等)在日益常见的数据异质场景中缺乏统计保证,此时每个噪声项可具有不同分布。针对该问题,我们提出符号翻转平行分析(Signflip PA)方法:将每个数据条目以1/2概率随机翻转符号生成"经验零假设"数据,进而比较原始数据奇异值与零假设数据奇异值。我们证明,在异质设定下的高维信号加噪声模型(包括尖刺模型和因子模型)中,Signflip PA能一致地选择高于噪声水平的因子。传统平行分析在此情形下不再有效。为此,我们依赖随机矩阵理论的最新成果,如维度无关算子范数界[Latala等, 2018, Inventiones Mathematicae]以及非齐次矩阵最大特征值的大偏差理论[Husson, 2020]。数值模拟与实证数据分析均表明Signflip PA方法表现优异。