Principal component analysis (PCA) is a fundamental tool in multivariate statistics, yet its sensitivity to outliers and limitations in distributed environments restrict its effectiveness in modern large-scale applications. To address these challenges, we introduce the $\phi$-PCA framework which provides a unified formulation of robust and distributed PCA. The class of $\phi$-PCA methods retains the asymptotic efficiency of standard PCA, while aggregating multiple local estimates using a proper $\phi$ function enhances ordering-robustness, leading to more accurate eigensubspace estimation under contamination. Notably, the harmonic mean PCA (HM-PCA), corresponding to the choice $\phi(u)=u^{-1}$, achieves optimal ordering-robustness and is recommended for practical use. Theoretical results further show that robustness increases with the number of partitions, a phenomenon seldom explored in the literature on robust or distributed PCA. Altogether, the partition-aggregation principle underlying $\phi$-PCA offers a general strategy for developing robust and efficiency-preserving methodologies applicable to both robust and distributed data analysis.
翻译:主成分分析(PCA)是多变量统计学中的基础工具,但其对异常值的敏感性以及在分布式环境中的局限性限制了其在现代大规模应用中的有效性。为解决这些挑战,我们提出了φ-PCA框架,该框架为鲁棒与分布式PCA提供了统一的形式化表述。φ-PCA方法族在保持标准PCA渐近效率的同时,通过采用适当的φ函数聚合多个局部估计量,增强了排序鲁棒性,从而在数据污染条件下实现更准确的特征子空间估计。特别地,对应于选择φ(u)=u^{-1}的调和平均PCA(HM-PCA)实现了最优的排序鲁棒性,推荐在实际应用中使用。理论结果进一步表明,鲁棒性随分区数量的增加而提升,这一现象在现有鲁棒或分布式PCA文献中鲜有探讨。总体而言,φ-PCA所基于的"分区-聚合"原理为开发适用于鲁棒与分布式数据分析的、保持效率的鲁棒方法论提供了一种通用策略。