Since its inception in 1982, Oja's algorithm has become an established method for streaming principle component analysis (PCA). We study the problem of streaming PCA, where the data-points are sampled from an irreducible, aperiodic, and reversible Markov chain. Our goal is to estimate the top eigenvector of the unknown covariance matrix of the stationary distribution. This setting has implications in scenarios where data can solely be sampled from a Markov Chain Monte Carlo (MCMC) type algorithm, and the objective is to perform inference on parameters of the stationary distribution. Most convergence guarantees for Oja's algorithm in the literature assume that the data-points are sampled IID. For data streams with Markovian dependence, one typically downsamples the data to get a "nearly" independent data stream. In this paper, we obtain the first sharp rate for Oja's algorithm on the entire data, where we remove the logarithmic dependence on the sample size, $n$, resulting from throwing data away in downsampling strategies.
翻译:自1982年提出以来,Oja算法已成为流式主成分分析(PCA)的经典方法。本文研究数据点采样自不可约、非周期且可逆马尔可夫链的流式PCA问题。我们的目标是估计平稳分布未知协方差矩阵的主特征向量。该设定适用于数据仅能通过马尔可夫链蒙特卡洛(MCMC)类算法采样,且旨在对平稳分布参数进行推断的场景。现有文献中Oja算法的大多数收敛性保证均假设数据点独立同分布(IID)。对于具有马尔可夫依赖性的数据流,通常采用降采样策略以获得"近乎"独立的数据流。本文首次给出了Oja算法在全数据集上的精确收敛速率,该结果消除了降采样策略中因丢弃数据而产生的样本量$n$的对数依赖性。