PCA, SVD, and Centering of Data

The research detailed in this paper scrutinizes Principal Component Analysis (PCA), a seminal method employed in statistics and machine learning for the purpose of reducing data dimensionality. Singular Value Decomposition (SVD) is often employed as the primary means for computing PCA, a process that indispensably includes the step of centering - the subtraction of the mean location from the data set. In our study, we delve into a detailed exploration of the influence of this critical yet often ignored or downplayed data centering step. Our research meticulously investigates the conditions under which two PCA embeddings, one derived from SVD with centering and the other without, can be viewed as aligned. As part of this exploration, we analyze the relationship between the first singular vector and the mean direction, subsequently linking this observation to the congruity between two SVDs of centered and uncentered matrices. Furthermore, we explore the potential implications arising from the absence of centering in the context of performing PCA via SVD from a spectral analysis standpoint. Our investigation emphasizes the importance of a comprehensive understanding and acknowledgment of the subtleties involved in the computation of PCA. As such, we believe this paper offers a crucial contribution to the nuanced understanding of this foundational statistical method and stands as a valuable addition to the academic literature in the field of statistics.

翻译：本文详细审视了主成分分析（PCA）——统计学与机器学习中用于数据降维的一种开创性方法。奇异值分解（SVD）常被用作计算PCA的主要手段，而该过程不可或缺地包含中心化步骤——即从数据集中减去均值位置。在我们的研究中，我们深入探讨了这一关键却常被忽视或低估的数据中心化步骤的影响。我们细致研究了两种PCA嵌入（一种基于带中心化的SVD，另一种基于无中心化的SVD）在何种条件下可被视为对齐。作为探索的一部分，我们分析了第一奇异向量与均值方向之间的关系，进而将这一观察与中心化矩阵和非中心化矩阵的两种SVD之间的一致性联系起来。此外，我们从谱分析的角度探讨了在通过SVD执行PCA时缺乏中心化可能引发的潜在影响。我们的研究强调了全面理解并认识PCA计算中微妙之处的重要性。因此，我们认为本文对这一基础统计方法的细微理解做出了重要贡献，并成为统计学领域学术文献中的宝贵补充。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日