FADI: Fast Distributed Principal Component Analysis With High Accuracy for Large-Scale Federated Data

Principal component analysis (PCA) is one of the most popular methods for dimension reduction. In light of the rapidly growing large-scale data in federated ecosystems, the traditional PCA method is often not applicable due to privacy protection considerations and large computational burden. Algorithms were proposed to lower the computational cost, but few can handle both high dimensionality and massive sample size under the distributed setting. In this paper, we propose the FAst DIstributed (FADI) PCA method for federated data when both the dimension $d$ and the sample size $n$ are ultra-large, by simultaneously performing parallel computing along $d$ and distributed computing along $n$. Specifically, we utilize $L$ parallel copies of $p$-dimensional fast sketches to divide the computing burden along $d$ and aggregate the results distributively along the split samples. We present FADI under a general framework applicable to multiple statistical problems, and establish comprehensive theoretical results under the general framework. We show that FADI enjoys the same non-asymptotic error rate as the traditional PCA when $Lp \ge d$. We also derive inferential results that characterize the asymptotic distribution of FADI, and show a phase-transition phenomenon as $Lp$ increases. We perform extensive simulations to show that FADI substantially outperforms the existing methods in computational efficiency while preserving accuracy, and validate the distributional phase-transition phenomenon through numerical experiments. We apply FADI to the 1000 Genomes data to study the population structure.

翻译：主成分分析（PCA）是最常用的降维方法之一。随着联邦生态系统中大规模数据的快速增长，传统PCA方法因隐私保护要求和巨大计算负担而常不适用。现有算法虽致力于降低计算成本，但鲜有方法能在分布式环境下同时处理高维度和海量样本量问题。本文针对维度$d$与样本量$n$均超大规模的联邦数据，提出FAst DIstributed（FADI）PCA方法，该方法通过沿$d$方向并行计算与沿$n$方向分布式计算的协同机制实现高效降维。具体而言，我们利用$L$个$p$维快速草图并行副本分担沿$d$方向的计算负担，并在分割样本间分布式聚合结果。我们在适用于多种统计问题的通用框架下阐述FADI方法，并在该框架下建立全面理论结果。研究表明当$Lp \ge d$时，FADI具有与传统PCA相同的非渐近误差率。我们进一步推导出刻画FADI渐近分布的推断结论，并揭示随$Lp$增大出现的相变现象。大量模拟实验表明，FADI在保持精度的前提下，计算效率显著优于现有方法，数值实验验证了分布式相变现象。我们将FADI应用于千人基因组数据以研究群体结构。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日