Efficient representations of data are essential for processing, exploration, and human understanding, and Principal Component Analysis (PCA) is one of the most common dimensionality reduction techniques used for the analysis of large, multivariate datasets today. Two well-known limitations of the method include sensitivity to outliers and noise and no clear methodology for the uncertainty quantification of the principle components or their associated explained variances. Whereas previous work has focused on each of these problems individually, we propose a scalable method called Ensemble PCA (EPCA) that addresses them simultaneously for data which has an inherently low-rank structure. EPCA combines boostrapped PCA with k-means cluster analysis to handle challenges associated with sign-ambiguity and the re-ordering of components in the PCA subsamples. EPCA provides a noise-resistant extension of PCA that lends itself naturally to uncertainty quantification. We test EPCA on data corrupted with white noise, sparse noise, and outliers against both classical PCA and Robust PCA (RPCA) and show that EPCA performs competitively across different noise scenarios, with a clear advantage on datasets containing outliers and orders of magnitude reduction in computational cost compared to RPCA.
翻译:数据的高效表示对于处理、探索和人类理解至关重要,主成分分析(PCA)是当今用于分析大规模多变量数据的最常见降维技术之一。该方法有两个众所周知的局限性:对异常值和噪声敏感,且缺乏对主成分及其相关解释方差进行不确定性量化的明确方法。以往的研究分别针对这些问题,而我们提出了一种可扩展的方法——集成PCA(EPCA),该方法能同时解决具有内在低秩结构数据的上述问题。EPCA结合了自助法PCA与k均值聚类分析,以处理PCA子样本中与符号歧义和成分重排序相关的挑战。EPCA提供了一种抗噪声的PCA扩展,且天然适用于不确定性量化。我们将在被白噪声、稀疏噪声和异常值污染的数据上测试EPCA,并将其与经典PCA和鲁棒PCA(RPCA)进行对比。结果表明,EPCA在不同噪声场景下表现竞争性,尤其在包含异常值的数据集上具有明显优势,且与RPCA相比计算成本降低了若干数量级。