Ensemble Principal Component Analysis

Efficient representations of data are essential for processing, exploration, and human understanding, and Principal Component Analysis (PCA) is one of the most common dimensionality reduction techniques used for the analysis of large, multivariate datasets today. Two well-known limitations of the method include sensitivity to outliers and noise and no clear methodology for the uncertainty quantification of the principle components or their associated explained variances. Whereas previous work has focused on each of these problems individually, we propose a scalable method called Ensemble PCA (EPCA) that addresses them simultaneously for data which has an inherently low-rank structure. EPCA combines boostrapped PCA with k-means cluster analysis to handle challenges associated with sign-ambiguity and the re-ordering of components in the PCA subsamples. EPCA provides a noise-resistant extension of PCA that lends itself naturally to uncertainty quantification. We test EPCA on data corrupted with white noise, sparse noise, and outliers against both classical PCA and Robust PCA (RPCA) and show that EPCA performs competitively across different noise scenarios, with a clear advantage on datasets containing outliers and orders of magnitude reduction in computational cost compared to RPCA.

翻译：数据的高效表示对于处理、探索和人类理解至关重要，主成分分析（PCA）是当今用于分析大规模多变量数据的最常见降维技术之一。该方法有两个众所周知的局限性：对异常值和噪声敏感，且缺乏对主成分及其相关解释方差进行不确定性量化的明确方法。以往的研究分别针对这些问题，而我们提出了一种可扩展的方法——集成PCA（EPCA），该方法能同时解决具有内在低秩结构数据的上述问题。EPCA结合了自助法PCA与k均值聚类分析，以处理PCA子样本中与符号歧义和成分重排序相关的挑战。EPCA提供了一种抗噪声的PCA扩展，且天然适用于不确定性量化。我们将在被白噪声、稀疏噪声和异常值污染的数据上测试EPCA，并将其与经典PCA和鲁棒PCA（RPCA）进行对比。结果表明，EPCA在不同噪声场景下表现竞争性，尤其在包含异常值的数据集上具有明显优势，且与RPCA相比计算成本降低了若干数量级。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日