Heavy-tail phenomena in stochastic gradient descent (SGD) have been reported in several empirical studies. Experimental evidence in previous works suggests a strong interplay between the heaviness of the tails and generalization behavior of SGD. To address this empirical phenomena theoretically, several works have made strong topological and statistical assumptions to link the generalization error to heavy tails. Very recently, new generalization bounds have been proven, indicating a non-monotonic relationship between the generalization error and heavy tails, which is more pertinent to the reported empirical observations. While these bounds do not require additional topological assumptions given that SGD can be modeled using a heavy-tailed stochastic differential equation (SDE), they can only apply to simple quadratic problems. In this paper, we build on this line of research and develop generalization bounds for a more general class of objective functions, which includes non-convex functions as well. Our approach is based on developing Wasserstein stability bounds for heavy-tailed SDEs and their discretizations, which we then convert to generalization bounds. Our results do not require any nontrivial assumptions; yet, they shed more light to the empirical observations, thanks to the generality of the loss functions.
翻译:随机梯度下降(SGD)中的重尾现象已在多项实证研究中被报道。先前工作的实验证据表明,尾部的重尾程度与SGD的泛化行为之间存在强烈的相互作用。为从理论上解释这一实证现象,已有若干工作通过引入强拓扑和统计假设来建立泛化误差与重尾之间的联系。最近,新的泛化界被证明,表明泛化误差与重尾之间呈非单调关系,这与已报道的实证观察更为一致。尽管在SGD可被建模为重尾随机微分方程(SDE)的条件下,这些界无需额外的拓扑假设,但它们仅适用于简单的二次问题。本文在此基础上,针对更一般的目标函数类别(包括非凸函数)发展泛化界。我们的方法基于为重尾SDE及其离散化建立Wasserstein稳定性界,进而将其转化为泛化界。我们的结果无需任何非平凡假设;然而,得益于损失函数的普适性,这些结果对实证观察提供了更深入的阐释。