We address the problem of solving strongly convex and smooth minimization problems using stochastic gradient descent (SGD) algorithm with a constant step size. Previous works suggested to combine the Polyak-Ruppert averaging procedure with the Richardson-Romberg extrapolation technique to reduce the asymptotic bias of SGD at the expense of a mild increase of the variance. We significantly extend previous results by providing an expansion of the mean-squared error of the resulting estimator with respect to the number of iterations $n$. More precisely, we show that the mean-squared error can be decomposed into the sum of two terms: a leading one of order $\mathcal{O}(n^{-1/2})$ with explicit dependence on a minimax-optimal asymptotic covariance matrix, and a second-order term of order $\mathcal{O}(n^{-3/4})$ where the power $3/4$ can not be improved in general. We also extend this result to the $p$-th moment bound keeping optimal scaling of the remainders with respect to $n$. Our analysis relies on the properties of the SGD iterates viewed as a time-homogeneous Markov chain. In particular, we establish that this chain is geometrically ergodic with respect to a suitably defined weighted Wasserstein semimetric.
翻译:我们研究了使用恒定步长的随机梯度下降(SGD)算法求解强凸且光滑的最小化问题。先前的工作建议将Polyak-Ruppert平均过程与Richardson-Romberg外推技术结合,以降低SGD的渐近偏差,代价是方差的轻微增加。我们通过给出所得估计量关于迭代次数 $n$ 的均方误差展开式,显著扩展了先前的结果。更准确地说,我们证明了均方误差可以分解为两项之和:一项是主导项,其阶为 $\mathcal{O}(n^{-1/2})$,并显式依赖于一个极小极大最优的渐近协方差矩阵;另一项是二阶项,其阶为 $\mathcal{O}(n^{-3/4})$,其中指数 $3/4$ 在一般情况下无法改进。我们还将此结果扩展到 $p$ 阶矩界,保持了余项关于 $n$ 的最优尺度。我们的分析依赖于将SGD迭代视为一个时间齐次马尔可夫链的性质。特别地,我们证明了该链关于适当定义的加权Wasserstein半度量是几何遍历的。