Uniform-in-Time Wasserstein Stability Bounds for (Noisy) Stochastic Gradient Descent

Algorithmic stability is an important notion that has proven powerful for deriving generalization bounds for practical algorithms. The last decade has witnessed an increasing number of stability bounds for different algorithms applied on different classes of loss functions. While these bounds have illuminated various properties of optimization algorithms, the analysis of each case typically required a different proof technique with significantly different mathematical tools. In this study, we make a novel connection between learning theory and applied probability and introduce a unified guideline for proving Wasserstein stability bounds for stochastic optimization algorithms. We illustrate our approach on stochastic gradient descent (SGD) and we obtain time-uniform stability bounds (i.e., the bound does not increase with the number of iterations) for strongly convex losses and non-convex losses with additive noise, where we recover similar results to the prior art or extend them to more general cases by using a single proof technique. Our approach is flexible and can be generalizable to other popular optimizers, as it mainly requires developing Lyapunov functions, which are often readily available in the literature. It also illustrates that ergodicity is an important component for obtaining time-uniform bounds -- which might not be achieved for convex or non-convex losses unless additional noise is injected to the iterates. Finally, we slightly stretch our analysis technique and prove time-uniform bounds for SGD under convex and non-convex losses (without additional additive noise), which, to our knowledge, is novel.

翻译：算法稳定性是一个重要概念，它在为实用算法推导泛化界方面已证明非常有效。过去十年，针对不同损失函数类别上应用的不同算法，涌现出越来越多的稳定性界。尽管这些界阐明了优化算法的各种特性，但每种情况的分析通常需要采用不同的证明技巧和显著不同的数学工具。在本研究中，我们在学习理论与应用概率之间建立了新颖联系，并引入了一种统一的指南，用于证明随机优化算法的Wasserstein稳定性界。我们将该方法应用于随机梯度下降（SGD），对于强凸损失和带加性噪声的非凸损失，获得了时间一致的稳定性界（即界值不随迭代次数增加），仅使用单一证明技术即可复现现有成果或将其扩展到更一般的情况。我们的方法灵活且可推广到其他常用优化器，主要需要构造李雅普诺夫函数——这类函数通常在文献中易于获得。研究还表明，遍历性是获得时间一致界的重要条件——除非在迭代中加入额外噪声，否则凸损失或非凸损失可能无法实现该性质。最后，我们略微扩展了分析技术，证明了对凸损失和非凸损失（无额外加性噪声）下SGD的时间一致界——据我们所知，这属于创新性成果。