Convergence Analysis of Stochastic Gradient Descent with MCMC Estimators

Understanding stochastic gradient descent (SGD) and its variants is essential for machine learning. However, most of the preceding analyses are conducted under amenable conditions such as unbiased gradient estimator and bounded objective functions, which does not encompass many sophisticated applications, such as variational Monte Carlo, entropy-regularized reinforcement learning and variational inference. In this paper, we consider the SGD algorithm that employ the Markov Chain Monte Carlo (MCMC) estimator to compute the gradient, called MCMC-SGD. Since MCMC reduces the sampling complexity significantly, it is an asymptotically convergent biased estimator in practice. Moreover, by incorporating a general class of unbounded functions, it is much more difficult to analyze the MCMC sampling error. Therefore, we assume that the function is sub-exponential and use the Bernstein inequality for non-stationary Markov chains to derive error bounds of the MCMC estimator. Consequently, MCMC-SGD is proven to have a first order convergence rate $O(\log K/\sqrt{n K})$ with $K$ iterations and a sample size $n$. It partially explains how MCMC influences the behavior of SGD. Furthermore, we verify the correlated negative curvature condition under reasonable assumptions. It is shown that MCMC-SGD escapes from saddle points and reaches $(\epsilon,\epsilon^{1/4})$ approximate second order stationary points or $\epsilon^{1/2}$-variance points at least $O(\epsilon^{-11/2}\log^{2}(1/\epsilon) )$ steps with high probability. Our analysis unveils the convergence pattern of MCMC-SGD across a broad class of stochastic optimization problems, and interprets the convergence phenomena observed in practical applications.

翻译：理解随机梯度下降（SGD）及其变体对机器学习至关重要。然而，以往的大多数分析都是在有利条件下进行的，例如无偏梯度估计和有界目标函数，这些条件并不涵盖许多复杂的应用，如变分蒙特卡洛、熵正则化强化学习和变分推断。本文考虑采用马尔可夫链蒙特卡洛（MCMC）估计量计算梯度的SGD算法，称为MCMC-SGD。由于MCMC显著降低了采样复杂度，它在实践中是一种渐近收敛的有偏估计量。此外，通过纳入一类一般化的无界函数，分析MCMC采样误差变得更加困难。因此，我们假设函数为次指数型，并利用非平稳马尔可夫链的伯恩斯坦不等式来推导MCMC估计量的误差界。由此证明，MCMC-SGD在$K$次迭代和样本量$n$下具有$O(\log K/\sqrt{n K})$的一阶收敛速率。这在一定程度上解释了MCMC如何影响SGD的行为。进一步地，我们在合理假设下验证了相关负曲率条件。结果表明，MCMC-SGD能以高概率在至多$O(\epsilon^{-11/2}\log^{2}(1/\epsilon))$步内逃离鞍点，并达到$(\epsilon,\epsilon^{1/4})$近似二阶驻点或$\epsilon^{1/2}$方差点。我们的分析揭示了MCMC-SGD在一大类随机优化问题中的收敛模式，并解释了实际应用中观察到的收敛现象。