On the Error-Propagation of Inexact Deflation for Principal Component Analysis

Principal Component Analysis (PCA) is a popular tool in data analysis, especially when the data is high-dimensional. PCA aims to find subspaces, spanned by the so-called \textit{principal components}, that best explain the variance in the dataset. The deflation method is a popular meta-algorithm -- used to discover such subspaces -- that sequentially finds individual principal components, starting from the most important one and working its way towards the less important ones. However, due to its sequential nature, the numerical error introduced by not estimating principal components exactly -- e.g., due to numerical approximations through this process -- propagates, as deflation proceeds. To the best of our knowledge, this is the first work that mathematically characterizes the error propagation of the inexact deflation method, and this is the key contribution of this paper. We provide two main results: $i)$ when the sub-routine for finding the leading eigenvector is generic, and $ii)$ when power iteration is used as the sub-routine. In the latter case, the additional directional information from power iteration allows us to obtain a tighter error bound than the analysis of the sub-routine agnostic case. As an outcome, we provide explicit characterization on how the error progresses and affects subsequent principal component estimations for this fundamental problem.

翻译：主成分分析（PCA）是数据分析中常用的工具，尤其适用于高维数据场景。PCA旨在寻找由所谓“主成分”张成的子空间，这些子空间能最优解释数据集中的方差。缩减方法是一种流行的元算法——用于发现此类子空间——它从最重要的主成分开始，依次向次重要方向顺序求解独立主成分。然而，由于其顺序求解特性，在缩减过程中，因非精确估计主成分产生的数值误差（例如由该过程中的数值近似导致）会随步骤推进而传播。据我们所知，本研究首次从数学层面刻画了非精确缩减方法的误差传播机制，这也是本文的核心贡献。我们给出两个主要结论：$i)$ 当寻找主特征向量的子算法具有通用性时；$ii)$ 当采用幂迭代作为子算法时。在后一种情形下，幂迭代提供的额外方向信息使我们能够获得比子算法无关分析更紧致的误差界。最终，我们针对这一基础性问题，明确阐述了误差如何演进并影响后续主成分估计的完整机理。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日