On the efficiency-loss free ordering-robustness of product-PCA

This article studies the robustness of the eigenvalue ordering, an important issue when estimating the leading eigen-subspace by principal component analysis (PCA). In Yata and Aoshima (2010), cross-data-matrix PCA (CDM-PCA) was proposed and shown to have smaller bias than PCA in estimating eigenvalues. While CDM-PCA has the potential to achieve better estimation of the leading eigen-subspace than the usual PCA, its robustness is not well recognized. In this article, we first develop a more stable variant of CDM-PCA, which we call product-PCA (PPCA), that provides a more convenient formulation for theoretical investigation. Secondly, we prove that, in the presence of outliers, PPCA is more robust than PCA in maintaining the correct ordering of leading eigenvalues. The robustness gain in PPCA comes from the random data partition, and it does not rely on a data down-weighting scheme as most robust statistical methods do. This enables us to establish the surprising finding that, when there are no outliers, PPCA and PCA share the same asymptotic distribution. That is, the robustness gain of PPCA in estimating the leading eigen-subspace has no efficiency loss in comparison with PCA. Simulation studies and a face data example are presented to show the merits of PPCA. In conclusion, PPCA has a good potential to replace the role of the usual PCA in real applications whether outliers are present or not.

翻译：本文研究特征值排序的鲁棒性问题，这是通过主成分分析（PCA）估计主导特征子空间时的重要议题。Yata与Aoshima（2010）提出了交叉数据矩阵PCA（CDM-PCA），并证明其在特征值估计中比传统PCA具有更小的偏差。尽管CDM-PCA在主导特征子空间估计方面具有优于传统PCA的潜力，但其鲁棒性尚未得到充分认识。本文首先开发了一种更稳定的CDM-PCA变体，称为乘积-PCA（PPCA），为理论研究提供了更便捷的公式化表达。其次，我们证明在存在异常值时，PPCA在维持主导特征值正确排序方面比PCA更具鲁棒性。PPCA的鲁棒性增益源于随机数据分割，而非像大多数鲁棒统计方法那样依赖数据降权机制。这一特性使我们得出惊人结论：当不存在异常值时，PPCA与PCA具有相同的渐近分布。也就是说，PPCA在估计主导特征子空间时的鲁棒性增益相较于PCA不存在效率损失。通过模拟实验和面部数据实例展示了PPCA的优势。结论表明，无论数据中是否存在异常值，PPCA在实际应用中都具备替代传统PCA角色的良好潜力。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日