Thresholded Oja does Sparse PCA?

We consider the problem of Sparse Principal Component Analysis (PCA) when the ratio $d/n \rightarrow c > 0$. There has been a lot of work on optimal rates on sparse PCA in the offline setting, where all the data is available for multiple passes. In contrast, when the population eigenvector is $s$-sparse, streaming algorithms that have $O(d)$ storage and $O(nd)$ time complexity either typically require strong initialization conditions or have a suboptimal error. We show that a simple algorithm that thresholds and renormalizes the output of Oja's algorithm (the Oja vector) obtains an optimal error rate. This is very surprising because, without thresholding, the Oja vector has a large error. Our analysis centers around bounding the entries of the unnormalized Oja vector, which involves the projection of a product of independent random matrices on a random initial vector. This is nontrivial and novel since previous analyses of Oja's algorithm and matrix products have been done when the trace of the population covariance matrix is bounded while in our setting, this quantity can be as large as $n$.

翻译：我们研究当比率 $d/n \rightarrow c > 0$ 时的稀疏主成分分析（Sparse PCA）问题。在离线场景下，已有大量研究针对稀疏PCA的最优速率，此类场景中所有数据可供多次迭代处理。相比之下，当总体特征向量为 $s$-稀疏时，存储复杂度为 $O(d)$ 且时间复杂度为 $O(nd)$ 的流式算法要么需要强初始化条件，要么存在次优误差。我们证明，通过对Oja算法输出（Oja向量）进行阈值化并重新归一化的简单算法，即可获得最优误差率。这一结果非常令人惊讶，因为未经阈值处理的Oja向量本身存在较大误差。我们的分析核心在于约束未归一化Oja向量的元素取值，这涉及将独立随机矩阵的乘积投影到随机初始向量上。由于此前对Oja算法和矩阵乘积的分析均假设总体协方差矩阵的迹有界，而本问题中该量可达到 $n$ 量级，因此本研究具有开创性且非平凡。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日