We focus on analyzing the classical stochastic projected gradient methods under a general dependent data sampling scheme for constrained smooth nonconvex optimization. We show the worst-case rate of convergence $\tilde{O}(t^{-1/4})$ and complexity $\tilde{O}(\varepsilon^{-4})$ for achieving an $\varepsilon$-near stationary point in terms of the norm of the gradient of Moreau envelope and gradient mapping. While classical convergence guarantee requires i.i.d. data sampling from the target distribution, we only require a mild mixing condition of the conditional distribution, which holds for a wide class of Markov chain sampling algorithms. This improves the existing complexity for the constrained smooth nonconvex optimization with dependent data from $\tilde{O}(\varepsilon^{-8})$ to $\tilde{O}(\varepsilon^{-4})$ with a significantly simpler analysis. We illustrate the generality of our approach by deriving convergence results with dependent data for stochastic proximal gradient methods, adaptive stochastic gradient algorithm AdaGrad and stochastic gradient algorithm with heavy ball momentum. As an application, we obtain first online nonnegative matrix factorization algorithms for dependent data based on stochastic projected gradient methods with adaptive step sizes and optimal rate of convergence.
翻译:本文聚焦于分析经典随机投影梯度方法在一般依赖数据采样方案下求解约束光滑非凸优化问题的表现。我们证明了其最坏情况下的收敛率为$\tilde{O}(t^{-1/4})$,且达到$\varepsilon$-近似稳定点(基于Moreau包络梯度范数与梯度映射)的复杂度为$\tilde{O}(\varepsilon^{-4})$。经典收敛性保证要求从目标分布中进行独立同分布数据采样,而本文仅需条件分布满足温和的混合条件——该条件适用于广泛的马尔可夫链采样算法。我们将约束光滑非凸优化在依赖数据场景下的现有复杂度从$\tilde{O}(\varepsilon^{-8})$改进至$\tilde{O}(\varepsilon^{-4})$,且分析方法显著简化。通过推导依赖数据下随机近端梯度方法、自适应随机梯度算法AdaGrad及带动量的随机梯度算法的收敛结果,我们展示了方法的普适性。作为应用,我们基于自适应步长随机投影梯度方法首次获得了依赖数据场景下的在线非负矩阵分解算法,并实现了最优收敛率。