Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods

In the past several years, the convergence of the last iterate of the Stochastic Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding. For Lipschitz and convex functions, different works have established the optimal $O(\log(1/\delta)\log T/\sqrt{T})$ or $O(\sqrt{\log(1/\delta)/T})$ high-probability convergence rates for the final iterate, where $T$ is the time horizon and $\delta$ is the failure probability. However, to prove these bounds, all the existing works are limited to compact domains or require almost surely bounded noises. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last iterate convergence of SGD for non-smooth problems, only few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a non-composite objective and the standard Euclidean norm. It still remains unclear whether the last-iterate convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterate convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness and (strong) convexity simultaneously. Additionally, we extend our analysis to obtain the last-iterate convergence under heavy-tailed noises.

翻译：过去几年中，随机梯度下降（SGD）算法的最终迭代收敛性因其实践中的良好表现但缺乏理论理解而引起了人们的兴趣。对于Lipschitz和凸函数，不同工作已建立了最终迭代的最优$O(\log(1/\delta)\log T/\sqrt{T})$或$O(\sqrt{\log(1/\delta)/T})$高概率收敛速率，其中$T$是时间范围，$\delta$是失败概率。然而，为了证明这些界，现有所有工作都局限于紧致域或要求几乎有界噪声。自然的问题是，SGD的最终迭代是否仍能保证最优收敛速率，但无需这两个限制性假设。除了这个重要问题外，仍有许多理论问题缺乏答案。例如，与非光滑问题的SGD最终迭代收敛性相比，光滑优化的结果仍很少。此外，现有结果都局限于非复合目标函数和标准欧几里得范数。最终迭代收敛性能否被证明扩展到更广泛的复合优化和非欧几里得范数，目前仍不清楚。在这项工作中，为了解决上述问题，我们重新探了随机梯度方法的最终迭代收敛性，并首次提供了一种统一的方法来证明期望和高概率下的收敛速率，以同时适应一般域、复合目标、非欧几里得范数、Lipschitz条件、光滑性和（强）凸性。此外，我们将分析扩展到重尾噪声下的最终迭代收敛性。