Robust Principal Component Analysis: A Median of Means Approach

Principal Component Analysis (PCA) is a fundamental tool for data visualization, denoising, and dimensionality reduction. It is widely popular in Statistics, Machine Learning, Computer Vision, and related fields. However, PCA is well-known to fall prey to outliers and often fails to detect the true underlying low-dimensional structure within the dataset. Following the Median of Means (MoM) philosophy, recent supervised learning methods have shown great success in dealing with outlying observations without much compromise to their large sample theoretical properties. This paper proposes a PCA procedure based on the MoM principle. Called the \textbf{M}edian of \textbf{M}eans \textbf{P}rincipal \textbf{C}omponent \textbf{A}nalysis (MoMPCA), the proposed method is not only computationally appealing but also achieves optimal convergence rates under minimal assumptions. In particular, we explore the non-asymptotic error bounds of the obtained solution via the aid of the Rademacher complexities while granting absolutely no assumption on the outlying observations. The derived concentration results are not dependent on the dimension because the analysis is conducted in a separable Hilbert space, and the results only depend on the fourth moment of the underlying distribution in the corresponding norm. The proposal's efficacy is also thoroughly showcased through simulations and real data applications.

翻译：主成分分析（PCA）是数据可视化、去噪和降维的基础工具，广泛应用于统计学、机器学习、计算机视觉及相关领域。然而，PCA易受异常值影响，常无法检测数据集中真正的潜在低维结构。遵循均值中位数（MoM）原则，近期监督学习方法在处理异常观测值时取得了显著成功，且未过多牺牲其大样本理论性质。本文提出一种基于MoM原则的PCA过程，称为均值中位数主成分分析（MoMPCA）。该方法不仅计算简便，而且在最小假设下实现了最优收敛速率。特别地，我们借助Rademacher复杂度推导了所得解的非渐近误差界，且对异常观测值完全不设任何假设。由于分析在可分离希尔伯特空间中进行，导出的集中结果不依赖于维度，仅依赖于对应范数下底层分布的四阶矩。通过模拟实验和实际数据应用，该方法的有效性也得到了全面展示。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日