Heavy-tail phenomena in stochastic gradient descent (SGD) have been reported in several empirical studies. Experimental evidence in previous works suggests a strong interplay between the heaviness of the tails and generalization behavior of SGD. To address this empirical phenomena theoretically, several works have made strong topological and statistical assumptions to link the generalization error to heavy tails. Very recently, new generalization bounds have been proven, indicating a non-monotonic relationship between the generalization error and heavy tails, which is more pertinent to the reported empirical observations. While these bounds do not require additional topological assumptions given that SGD can be modeled using a heavy-tailed stochastic differential equation (SDE), they can only apply to simple quadratic problems. In this paper, we build on this line of research and develop generalization bounds for a more general class of objective functions, which includes non-convex functions as well. Our approach is based on developing Wasserstein stability bounds for heavy-tailed SDEs and their discretizations, which we then convert to generalization bounds. Our results do not require any nontrivial assumptions; yet, they shed more light to the empirical observations, thanks to the generality of the loss functions.
翻译:重尾现象在随机梯度下降(SGD)中已被多项实证研究报道。先前工作的实验证据表明,尾部的厚重程度与SGD的泛化行为之间存在强烈相互作用。为从理论上解释这一实证现象,多项研究基于强拓扑和统计假设,将泛化误差与重尾特性联系起来。最近有研究证明了新的泛化界限,指出泛化误差与重尾之间存在非单调关系,这与已报道的实证观察更为吻合。尽管在SGD可通过重尾随机微分方程(SDE)建模的前提下,这些界限无需额外拓扑假设,但它们仅适用于简单二次问题。本文在此基础上,针对更一般的目标函数(包括非凸函数)建立了泛化界限。我们的方法基于为重尾SDE及其离散化建立Wasserstein稳定性界限,并进一步将其转化为泛化界限。我们的结果无需任何非平凡假设;然而,得益于损失函数的普适性,这些结果为进一步理解实验现象提供了新视角。