Principal Component Analysis (PCA) is a critical tool for dimensionality reduction and data analysis. This paper revisits PCA through the lens of generalized spiked covariance and correlation models, which allow for more realistic and complex data structures. We explore the asymptotic properties of the sample principal components (PCs) derived from both the sample covariance and correlation matrices, focusing on how data normalization-an essential step for scale-invariant analysis-affects these properties. Our results reveal that while normalization does not alter the first-order limits of spiked eigenvalues and eigenvectors, it significantly influences their second-order behavior. We establish new theoretical findings, including a joint central limit theorem for bilinear forms of the sample covariance matrix's resolvent and diagonal entries, providing a robust framework for understanding spiked models in high dimensions. Our theoretical results also reveal an intriguing phenomenon regarding the effect of data normalization when the variances of covariates are equal. Specifically, they suggest that high-dimensional PCA based on the correlation matrix may not only perform comparably to, but potentially even outperform, PCA based on the covariance matrix-particularly when the leading principal component is sufficiently large. This study not only extends the existing literature on spiked models but also offers practical guidance for applying PCA in real-world scenarios, particularly when dealing with normalized data.
翻译:主成分分析(PCA)是降维与数据分析的关键工具。本文通过广义尖峰协方差及相关性模型的视角重新审视PCA,这些模型能够描述更真实且复杂的数据结构。我们探究了从样本协方差矩阵与样本相关矩阵导出的样本主成分(PCs)的渐近性质,重点关注数据归一化——尺度不变分析的关键步骤——如何影响这些性质。我们的结果表明,尽管归一化不会改变尖峰特征值与特征向量的一阶极限,但会显著影响其二阶行为。我们建立了新的理论发现,包括样本协方差矩阵的预解式与其对角元素的双线性形式的联合中心极限定理,为理解高维尖峰模型提供了稳健的框架。我们的理论结果还揭示了当协变量方差相等时数据归一化效应的一个有趣现象:具体而言,它们表明基于相关矩阵的高维PCA不仅可能表现与基于协方差矩阵的PCA相当,甚至可能更优——尤其是在主导主成分足够大的情况下。本研究不仅扩展了现有关于尖峰模型的文献,也为在实际场景(特别是处理归一化数据时)应用PCA提供了实用指导。