We consider the problem of Sparse Principal Component Analysis (PCA) when the ratio $d/n \rightarrow c > 0$. There has been a lot of work on optimal rates on sparse PCA in the offline setting, where all the data is available for multiple passes. In contrast, when the population eigenvector is $s$-sparse, streaming algorithms that have $O(d)$ storage and $O(nd)$ time complexity either typically require strong initialization conditions or have a suboptimal error. We show that a simple algorithm that thresholds and renormalizes the output of Oja's algorithm (the Oja vector) obtains a near-optimal error rate. This is very surprising because, without thresholding, the Oja vector has a large error. Our analysis centers around bounding the entries of the unnormalized Oja vector, which involves the projection of a product of independent random matrices on a random initial vector. This is nontrivial and novel since previous analyses of Oja's algorithm and matrix products have been done when the trace of the population covariance matrix is bounded while in our setting, this quantity can be as large as $n$.
翻译:我们考虑当比率$d/n \rightarrow c > 0$时的稀疏主成分分析问题。在离线场景下(即所有数据可用于多次迭代),已有大量关于稀疏PCA最优速率的研究工作。相比之下,当总体特征向量为$s$稀疏时,具有$O(d)$存储和$O(nd)$时间复杂度的流式算法要么需要强初始化条件,要么存在次优误差。本文证明,对Oja算法输出(Oja向量)进行阈值化与重归一化的简单算法能够获得接近最优的误差率。这一结果令人非常惊讶,因为未经阈值化的Oja向量存在较大误差。我们的分析核心在于界定非归一化Oja向量各分量的上界——这涉及随机初始向量在独立随机矩阵乘积上的投影。该分析具有非平凡性与新颖性,因为此前对Oja算法及矩阵乘积的分析均假设总体协方差矩阵的迹有界,而本文场景中该量可达到$n$量级。