Oja's Algorithm for Sparse PCA

Oja's algorithm for streaming Principal Component Analysis (PCA) for $n$ datapoints in a $d$ dimensional space achieves the same sin-squared error $O(r_\mathsf{eff}/n)$ as the offline algorithm in $O(d)$ space and $O(nd)$ time and a single pass through the datapoints. Here $r_\mathsf{eff}$ is the effective rank (ratio of the trace and the principal eigenvalue of the population covariance matrix $\Sigma$). Under this computational budget, we consider the problem of sparse PCA, where the principal eigenvector of $\Sigma$ is $s$-sparse, and $r_\mathsf{eff}$ can be large. In this setting, to our knowledge, \textit{there are no known single-pass algorithms} that achieve the minimax error bound in $O(d)$ space and $O(nd)$ time without either requiring strong initialization conditions or assuming further structure (e.g., spiked) of the covariance matrix. We show that a simple single-pass procedure that thresholds the output of Oja's algorithm (the Oja vector) can achieve the minimax error bound under some regularity conditions in $O(d)$ space and $O(nd)$ time as long as $r_\mathsf{eff}=O(n/\log n)$. We present a nontrivial and novel analysis of the entries of the unnormalized Oja vector, which involves the projection of a product of independent random matrices on a random initial vector. This is completely different from previous analyses of Oja's algorithm and matrix products, which have been done when the $r_\mathsf{eff}$ is bounded.

翻译：Oja算法针对$d$维空间中$n$个数据点的流式主成分分析（PCA），在$O(d)$空间复杂度、$O(nd)$时间复杂度和单次数据遍历的条件下，达到了与离线算法相同的正弦平方误差$O(r_\mathsf{eff}/n)$。此处$r_\mathsf{eff}$表示有效秩（总体协方差矩阵$\Sigma$的迹与主特征值之比）。在此计算资源约束下，我们研究稀疏PCA问题，其中$\Sigma$的主特征向量具有$s$-稀疏性，且$r_\mathsf{eff}$可能较大。在此设定下，据我们所知，\textit{目前尚无已知的单遍算法}能在$O(d)$空间和$O(nd)$时间内达到极小极大误差界，且既不要求强初始化条件，也不假设协方差矩阵具有额外结构（如尖峰模型）。我们证明，在$r_\mathsf{eff}=O(n/\log n)$的条件下，通过对Oja算法输出（Oja向量）进行阈值处理的简单单遍程序，可在某些正则性条件下以$O(d)$空间和$O(nd)$时间达到极小极大误差界。我们对未归一化Oja向量的分量进行了新颖且非平凡的分析，这涉及独立随机矩阵乘积在随机初始向量上的投影。该分析与以往$r_\mathsf{eff}$有界情形下对Oja算法及矩阵乘积的研究截然不同。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日