High-Dimensional PCA Revisited: Insights from General Spiked Models and Data Normalization Effects

Principal Component Analysis (PCA) is a critical tool for dimensionality reduction and data analysis. This paper revisits PCA through the lens of generalized spiked covariance and correlation models, which allow for more realistic and complex data structures. We explore the asymptotic properties of the sample principal components (PCs) derived from both the sample covariance and correlation matrices, focusing on how data normalization-an essential step for scale-invariant analysis-affects these properties. Our results reveal that while normalization does not alter the first-order limits of spiked eigenvalues and eigenvectors, it significantly influences their second-order behavior. We establish new theoretical findings, including a joint central limit theorem for bilinear forms of the sample covariance matrix's resolvent and diagonal entries, providing a robust framework for understanding spiked models in high dimensions. Our theoretical results also reveal an intriguing phenomenon regarding the effect of data normalization when the variances of covariates are equal. Specifically, they suggest that high-dimensional PCA based on the correlation matrix may not only perform comparably to, but potentially even outperform, PCA based on the covariance matrix-particularly when the leading principal component is sufficiently large. This study not only extends the existing literature on spiked models but also offers practical guidance for applying PCA in real-world scenarios, particularly when dealing with normalized data.

翻译：主成分分析（PCA）是降维与数据分析的关键工具。本文通过广义尖峰协方差及相关性模型的视角重新审视PCA，这些模型能够描述更真实且复杂的数据结构。我们探究了从样本协方差矩阵与样本相关矩阵导出的样本主成分（PCs）的渐近性质，重点关注数据归一化——尺度不变分析的关键步骤——如何影响这些性质。我们的结果表明，尽管归一化不会改变尖峰特征值与特征向量的一阶极限，但会显著影响其二阶行为。我们建立了新的理论发现，包括样本协方差矩阵的预解式与其对角元素的双线性形式的联合中心极限定理，为理解高维尖峰模型提供了稳健的框架。我们的理论结果还揭示了当协变量方差相等时数据归一化效应的一个有趣现象：具体而言，它们表明基于相关矩阵的高维PCA不仅可能表现与基于协方差矩阵的PCA相当，甚至可能更优——尤其是在主导主成分足够大的情况下。本研究不仅扩展了现有关于尖峰模型的文献，也为在实际场景（特别是处理归一化数据时）应用PCA提供了实用指导。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日