Uniform-in-Time Wasserstein Stability Bounds for (Noisy) Stochastic Gradient Descent

Algorithmic stability is an important notion that has proven powerful for deriving generalization bounds for practical algorithms. The last decade has witnessed an increasing number of stability bounds for different algorithms applied on different classes of loss functions. While these bounds have illuminated various properties of optimization algorithms, the analysis of each case typically required a different proof technique with significantly different mathematical tools. In this study, we make a novel connection between learning theory and applied probability and introduce a unified guideline for proving Wasserstein stability bounds for stochastic optimization algorithms. We illustrate our approach on stochastic gradient descent (SGD) and we obtain time-uniform stability bounds (i.e., the bound does not increase with the number of iterations) for strongly convex losses and non-convex losses with additive noise, where we recover similar results to the prior art or extend them to more general cases by using a single proof technique. Our approach is flexible and can be generalizable to other popular optimizers, as it mainly requires developing Lyapunov functions, which are often readily available in the literature. It also illustrates that ergodicity is an important component for obtaining time-uniform bounds -- which might not be achieved for convex or non-convex losses unless additional noise is injected to the iterates. Finally, we slightly stretch our analysis technique and prove time-uniform bounds for SGD under convex and non-convex losses (without additional additive noise), which, to our knowledge, is novel.

翻译：算法稳定性是一个重要概念，已被证明在推导实用算法泛化界方面卓有成效。过去十年间，针对不同损失函数类别上的各类算法，涌现出大量稳定性界。尽管这些界揭示了优化算法的多种特性，但每种情况的分析通常需要采用截然不同的证明技巧与数学工具。本研究在学习理论与应用概率之间建立了新颖联系，并提出了证明随机优化算法Wasserstein稳定性界的统一准则。我们以随机梯度下降（SGD）为例阐述该方法，针对强凸损失与带加性噪声的非凸损失获得了时间一致性稳定性界（即界值不随迭代次数增加），通过单一证明技术既复现了现有成果，又将结论推广至更一般情形。该方法灵活且可推广至其他主流优化器，核心仅需构造李雅普诺夫函数——这类函数在文献中通常易于获取。研究还揭示遍历性是获得时间一致性界的关键要素——若未向迭代过程注入额外噪声，凸损失或非凸损失可能无法满足该条件。最终，我们略微拓展分析技术，证明了凸损失与非凸损失下（无额外加性噪声）SGD的时间一致性界——据我们所知，这是该领域的首次发现。