Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods

from arxiv, The preliminary version has been accepted at ICLR 2024. This extended version was finished in November 2023 and revised in March 2024 with fixed typos

In the past several years, the last-iterate convergence of the Stochastic Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding. For Lipschitz convex functions, different works have established the optimal $O(\log(1/\delta)\log T/\sqrt{T})$ or $O(\sqrt{\log(1/\delta)/T})$ high-probability convergence rates for the final iterate, where $T$ is the time horizon and $\delta$ is the failure probability. However, to prove these bounds, all the existing works are either limited to compact domains or require almost surely bounded noises. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last-iterate convergence of SGD for non-smooth problems, only few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a non-composite objective and the standard Euclidean norm. It still remains unclear whether the last-iterate convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterate convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness, and (strong) convexity simultaneously. Additionally, we extend our analysis to obtain the last-iterate convergence under heavy-tailed noises.

翻译：在过去几年中，随机梯度下降（SGD）算法的最后迭代收敛性因其在实践中表现良好但缺乏理论理解而引发人们关注。对于Lipschitz凸函数，不同工作已为最终迭代建立了最优的$O(\log(1/\delta)\log T/\sqrt{T})$或$O(\sqrt{\log(1/\delta)/T})$高概率收敛率，其中$T$是时间范围，$\delta$是失败概率。然而，为证明这些界，现有工作要么局限于紧致域，要么要求噪声几乎必然有界。一个自然的问题是：SGD的最后迭代能否在无这两个限制性假设下仍保证最优收敛率？除这一重要问题外，仍有大量理论问题缺乏答案。例如，与非光滑问题的SGD最后迭代收敛性相比，光滑优化方面的结果尚鲜有发展。此外，现有结果均仅限于非复合目标函数和标准欧几里得范数。最后迭代收敛性能否扩展到更广泛的复合优化和非欧几里得范数仍不明确。本文针对上述问题，重新审视了随机梯度方法的最后迭代收敛性，并首次提出统一方法证明其期望和高概率收敛率，以同时适应一般域、复合目标函数、非欧几里得范数、Lipschitz条件、光滑性和（强）凸性。此外，我们将分析扩展到重尾噪声下的最后迭代收敛性。