Principal Component Analysis (PCA) is a popular tool in data analysis, especially when the data is high-dimensional. PCA aims to find subspaces, spanned by the so-called \textit{principal components}, that best explain the variance in the dataset. The deflation method is a popular meta-algorithm -- used to discover such subspaces -- that sequentially finds individual principal components, starting from the most important one and working its way towards the less important ones. However, due to its sequential nature, the numerical error introduced by not estimating principal components exactly -- e.g., due to numerical approximations through this process -- propagates, as deflation proceeds. To the best of our knowledge, this is the first work that mathematically characterizes the error propagation of the inexact deflation method, and this is the key contribution of this paper. We provide two main results: $i)$ when the sub-routine for finding the leading eigenvector is generic, and $ii)$ when power iteration is used as the sub-routine. In the latter case, the additional directional information from power iteration allows us to obtain a tighter error bound than the analysis of the sub-routine agnostic case. As an outcome, we provide explicit characterization on how the error progresses and affects subsequent principal component estimations for this fundamental problem.
翻译:主成分分析(PCA)是数据分析中常用的工具,尤其适用于高维数据场景。PCA旨在寻找由所谓“主成分”张成的子空间,这些子空间能最优解释数据集中的方差。缩减方法是一种流行的元算法——用于发现此类子空间——它从最重要的主成分开始,依次向次重要方向顺序求解独立主成分。然而,由于其顺序求解特性,在缩减过程中,因非精确估计主成分产生的数值误差(例如由该过程中的数值近似导致)会随步骤推进而传播。据我们所知,本研究首次从数学层面刻画了非精确缩减方法的误差传播机制,这也是本文的核心贡献。我们给出两个主要结论:$i)$ 当寻找主特征向量的子算法具有通用性时;$ii)$ 当采用幂迭代作为子算法时。在后一种情形下,幂迭代提供的额外方向信息使我们能够获得比子算法无关分析更紧致的误差界。最终,我们针对这一基础性问题,明确阐述了误差如何演进并影响后续主成分估计的完整机理。