Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares

Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails have links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation (and its Euler discretization) as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a \emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.

翻译：近年研究表明，重尾现象可能出现在随机优化中，且尾部重程度与泛化误差存在关联。尽管这些研究揭示了现代环境下泛化行为的有趣特征，但它们依赖于强拓扑与统计正则性假设，这类假设在实践中难以验证。此外，实证结果表明重尾与泛化之间的实际关系可能并非单调，这与现有理论的结论相悖。本研究通过算法稳定性视角，建立了随机梯度下降（SGD）尾部行为与泛化特性的新关联。我们考虑二次优化问题，以重尾随机微分方程（及其欧拉离散化）作为建模SGD中重尾行为的代理。随后我们证明了一致稳定性界，揭示以下结果：（i）在无任何特殊假设条件下，若以平方损失$x\mapsto x^2$衡量稳定性，SGD将不稳定；但若改用替代损失$x\mapsto |x|^p$（其中$p<2$），则SGD变得稳定。（ii）根据数据方差的不同，存在一个“重尾阈值”：当尾部轻于该阈值时，泛化误差随尾部加重而减小。这表明重尾与泛化之间的关系并非全局单调。（iii）我们证明了一致稳定性的匹配下界，表明本结论在尾部重程度方面具有紧致性。我们通过合成实验与真实神经网络实验支持了理论结果。