We consider the problem of Sparse Principal Component Analysis (PCA) when the ratio $d/n \rightarrow c > 0$. There has been a lot of work on optimal rates on sparse PCA in the offline setting, where all the data is available for multiple passes. In contrast, when the population eigenvector is $s$-sparse, streaming algorithms that have $O(d)$ storage and $O(nd)$ time complexity either typically require strong initialization conditions or have a suboptimal error. We show that a simple algorithm that thresholds and renormalizes the output of Oja's algorithm (the Oja vector) obtains an optimal error rate. This is very surprising because, without thresholding, the Oja vector has a large error. Our analysis centers around bounding the entries of the unnormalized Oja vector, which involves the projection of a product of independent random matrices on a random initial vector. This is nontrivial and novel since previous analyses of Oja's algorithm and matrix products have been done when the trace of the population covariance matrix is bounded while in our setting, this quantity can be as large as $n$.
翻译:我们研究当比率 $d/n \rightarrow c > 0$ 时的稀疏主成分分析(Sparse PCA)问题。在离线场景下,已有大量研究针对稀疏PCA的最优速率,此类场景中所有数据可供多次迭代处理。相比之下,当总体特征向量为 $s$-稀疏时,存储复杂度为 $O(d)$ 且时间复杂度为 $O(nd)$ 的流式算法要么需要强初始化条件,要么存在次优误差。我们证明,通过对Oja算法输出(Oja向量)进行阈值化并重新归一化的简单算法,即可获得最优误差率。这一结果非常令人惊讶,因为未经阈值处理的Oja向量本身存在较大误差。我们的分析核心在于约束未归一化Oja向量的元素取值,这涉及将独立随机矩阵的乘积投影到随机初始向量上。由于此前对Oja算法和矩阵乘积的分析均假设总体协方差矩阵的迹有界,而本问题中该量可达到 $n$ 量级,因此本研究具有开创性且非平凡。