Oja's algorithm for Streaming Principal Component Analysis (PCA) for $n$ data-points in a $d$ dimensional space achieves the same sin-squared error $O(r_{\mathsf{eff}}/n)$ as the offline algorithm in $O(d)$ space and $O(nd)$ time and a single pass through the datapoints. Here $r_{\mathsf{eff}}$ is the effective rank (ratio of the trace and the principal eigenvalue of the population covariance matrix $\Sigma$). Under this computational budget, we consider the problem of sparse PCA, where the principal eigenvector of $\Sigma$ is $s$-sparse, and $r_{\mathsf{eff}}$ can be large. In this setting, to our knowledge, \textit{there are no known single-pass algorithms} that achieve the minimax error bound in $O(d)$ space and $O(nd)$ time without either requiring strong initialization conditions or assuming further structure (e.g., spiked) of the covariance matrix. We show that a simple single-pass procedure that thresholds the output of Oja's algorithm (the Oja vector) can achieve the minimax error bound under some regularity conditions in $O(d)$ space and $O(nd)$ time. We present a nontrivial and novel analysis of the entries of the unnormalized Oja vector, which involves the projection of a product of independent random matrices on a random initial vector. This is completely different from previous analyses of Oja's algorithm and matrix products, which have been done when the $r_{\mathsf{eff}}$ is bounded.
翻译:针对$d$维空间中$n$个数据点的流式主成分分析(PCA),Oja算法在$O(d)$存储空间和$O(nd)$时间成本下,通过单次数据遍历即可达到与离线算法相同的正弦平方误差$O(r_{\mathsf{eff}}/n)$。其中$r_{\mathsf{eff}}$为有效秩(总体协方差矩阵$\Sigma$的迹与主特征值之比)。在此计算资源约束下,我们研究稀疏PCA问题——即$\Sigma$的主特征向量具有$s$稀疏性,且$r_{\mathsf{eff}}$可能较大的场景。据我们所知,在该设定下,\textit{目前尚无已知的单遍算法}能在$O(d)$存储和$O(nd)$时间内达到极小极大误差界,且既不依赖强初始化条件,也不需假设协方差矩阵具有额外结构(如尖峰模型)。我们证明,通过对Oja算法输出(Oja向量)进行阈值处理的简单单遍流程,在特定正则性条件下能以$O(d)$存储和$O(nd)$时间达到极小极大误差界。我们提出了对未归一化Oja向量分量的非平凡创新分析,该方法涉及独立随机矩阵乘积在随机初始向量上的投影运算。这与先前针对$r_{\mathsf{eff}}$有界情形开展的Oja算法及矩阵乘积分析有着本质区别。