We consider the optimization of a smooth and strongly convex objective using constant step-size stochastic gradient descent (SGD) and study its properties through the prism of Markov chains. We show that, for unbiased gradient estimates with mildly controlled variance, the iteration converges to an invariant distribution in total variation distance. We also establish this convergence in Wasserstein-2 distance under a relaxed assumption on the gradient noise distribution compared to previous work. Thanks to the invariance property of the limit distribution, our analysis shows that the latter inherits sub-Gaussian or sub-exponential concentration properties when these hold true for the gradient. This allows the derivation of high-confidence bounds for the final estimate. Finally, under such conditions in the linear case, we obtain a dimension-free deviation bound for the Polyak-Ruppert average of a tail sequence. All our results are non-asymptotic and their consequences are discussed through a few applications.
翻译:本文考虑使用常步长随机梯度下降(SGD)优化光滑强凸目标函数,并通过马尔可夫链视角研究其性质。我们证明:对于具有适度受控方差的无偏梯度估计,迭代过程在总变差距离下收敛至不变分布。在较以往工作更宽松的梯度噪声分布假设下,我们还建立了该收敛在Wasserstein-2距离下的成立性。借助极限分布的不变性,我们的分析表明,当梯度满足次高斯或次指数集中性时,该分布亦继承此类性质。这为最终估计量的高置信区间推导提供了依据。最后,在线性情形下,我们获得了尾部序列Polyak-Ruppert平均的无量纲偏差界。所有结论均为非渐近结果,并通过若干应用实例加以讨论。