Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods

In the past several years, the last-iterate convergence of the Stochastic Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding. For Lipschitz convex functions, different works have established the optimal $O(\log(1/δ)\log T/\sqrt{T})$ or $O(\sqrt{\log(1/δ)/T})$ high-probability convergence rates for the final iterate, where T is the time horizon and δis the failure probability. However, to prove these bounds, all the existing works are either limited to compact domains or require almost surely bounded noise. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last-iterate convergence of SGD for non-smooth problems, only few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a non-composite objective and the standard Euclidean norm. It still remains unclear whether the last-iterate convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterate convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness, and (strong) convexity simultaneously. Additionally, we extend our analysis to obtain the last-iterate convergence under heavy-tailed and sub-Weibull noise.

翻译：在过去几年中，随机梯度下降（SGD）算法的最后迭代收敛性因其在实践中表现优异但缺乏理论理解而引发关注。对于Lipschitz凸函数，不同研究已建立了最后迭代的最优高概率收敛速率，分别为$O(\log(1/δ)\log T/\sqrt{T})$或$O(\sqrt{\log(1/δ)/T})$，其中$T$是时间范围，$δ$是失败概率。然而，为证明这些界，所有现有工作要么局限于紧致域，要么要求噪声几乎必然有界。自然的问题是：SGD的最后迭代能否在无需这两个限制性假设的情况下仍保证最优收敛速率？除这一重要问题外，仍有大量理论问题尚待解答。例如，与非光滑问题中SGD的最后迭代收敛相比，针对光滑优化的结果目前仅有少量进展。此外，现有结果均局限于非复合目标函数和标准欧几里得范数。尚不清楚最后迭代收敛性能否被可证明地扩展到更广泛的复合优化和非欧几里得范数。在本工作中，为解决上述问题，我们重探随机梯度方法的最后迭代收敛性，并首次提供了统一的方法来证明期望意义下的收敛速率和高概率收敛速率，以同时适应一般域、复合目标函数、非欧几里得范数、Lipschitz条件、光滑性以及（强）凸性。此外，我们将分析扩展到重尾和次Weibull噪声下的最后迭代收敛性。