Principal component analysis (PCA) is one of the most popular methods for dimension reduction. In light of the rapidly growing large-scale data in federated ecosystems, the traditional PCA method is often not applicable due to privacy protection considerations and large computational burden. Algorithms were proposed to lower the computational cost, but few can handle both high dimensionality and massive sample size under the distributed setting. In this paper, we propose the FAst DIstributed (FADI) PCA method for federated data when both the dimension $d$ and the sample size $n$ are ultra-large, by simultaneously performing parallel computing along $d$ and distributed computing along $n$. Specifically, we utilize $L$ parallel copies of $p$-dimensional fast sketches to divide the computing burden along $d$ and aggregate the results distributively along the split samples. We present FADI under a general framework applicable to multiple statistical problems, and establish comprehensive theoretical results under the general framework. We show that FADI enjoys the same non-asymptotic error rate as the traditional PCA when $Lp \ge d$. We also derive inferential results that characterize the asymptotic distribution of FADI, and show a phase-transition phenomenon as $Lp$ increases. We perform extensive simulations to show that FADI substantially outperforms the existing methods in computational efficiency while preserving accuracy, and validate the distributional phase-transition phenomenon through numerical experiments. We apply FADI to the 1000 Genomes data to study the population structure.
翻译:主成分分析(PCA)是最常用的降维方法之一。随着联邦生态系统中大规模数据的快速增长,传统PCA方法因隐私保护要求和巨大计算负担而常不适用。现有算法虽致力于降低计算成本,但鲜有方法能在分布式环境下同时处理高维度和海量样本量问题。本文针对维度$d$与样本量$n$均超大规模的联邦数据,提出FAst DIstributed(FADI)PCA方法,该方法通过沿$d$方向并行计算与沿$n$方向分布式计算的协同机制实现高效降维。具体而言,我们利用$L$个$p$维快速草图并行副本分担沿$d$方向的计算负担,并在分割样本间分布式聚合结果。我们在适用于多种统计问题的通用框架下阐述FADI方法,并在该框架下建立全面理论结果。研究表明当$Lp \ge d$时,FADI具有与传统PCA相同的非渐近误差率。我们进一步推导出刻画FADI渐近分布的推断结论,并揭示随$Lp$增大出现的相变现象。大量模拟实验表明,FADI在保持精度的前提下,计算效率显著优于现有方法,数值实验验证了分布式相变现象。我们将FADI应用于千人基因组数据以研究群体结构。