Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

The study of tail behaviour of SGD-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the error rate for a fixed probability threshold, there is a lack of work directly studying the probability of failure, i.e., quantifying the tail decay rate for a fixed error threshold. Moreover, existing results are of finite-time nature, limiting their ability to capture the true long-term tail decay which is more informative for modern learning models, typically trained for millions of iterations. Our work closes these gaps, by studying the long-term tail decay of SGD-based methods through the lens of large deviations theory, establishing several strong results in the process. First, we provide an upper bound on the tails of the gradient norm-squared of the best iterate produced by (vanilla) SGD, for non-convex costs and bounded noise, with long-term decay at rate $e^{-t/\log(t)}$. Next, we relax the noise assumption by considering clipped SGD (c-SGD) under heavy-tailed noise with bounded moment of order $p \in (1,2]$, showing an upper bound with long-term decay at rate $e^{-t^{β_p}/\log(t)}$, where $β_p = \frac{4(p-1)}{3p-2}$ for $p \in (1,2)$ and $e^{-t/\log^2(t)}$ for $p = 2$. Finally, we provide lower bounds on the tail decay, at rate $e^{-t}$, showing that our rates for both SGD and c-SGD are tight, up to poly-logarithmic factors. Notably, our results demonstrate an order of magnitude faster long-term tail decay compared to existing work based on finite-time bounds, which show rates $e^{-\sqrt{t}}$ and $e^{-t^{β_p/2}}$, $p \in (1,2]$, for SGD and c-SGD, respectively. As such, we uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.

翻译：对随机梯度下降（SGD）所诱导过程的尾部行为研究，因其能为算法的单次运行提供强保证而备受关注。现有工作大多提供高概率保证，即量化固定概率阈值下的误差率，但直接研究失败概率（即量化固定误差阈值下的尾部衰减率）的工作尚显不足。此外，现有结果均属有限时间性质，难以捕捉对通常训练数百万次迭代的现代学习模型更具信息量的真实长期尾部衰减。本研究通过大偏差理论的视角，填补了这些空白，研究了基于SGD方法的长期尾部衰减，并在此过程中建立了若干强结果。首先，对于非凸代价函数和有界噪声，我们给出了（原始）SGD产生的最佳迭代点梯度范数平方的尾部上界，其长期衰减率为 $e^{-t/\log(t)}$。接着，我们放宽噪声假设，考虑在具有 $p \in (1,2]$ 阶有界矩的重尾噪声下的截断SGD（c-SGD），证明了其上界具有长期衰减率 $e^{-t^{β_p}/\log(t)}$，其中 $β_p = \frac{4(p-1)}{3p-2}$ 对应 $p \in (1,2)$，而 $p = 2$ 时为 $e^{-t/\log^2(t)}$。最后，我们给出了尾部衰减的下界，衰减率为 $e^{-t}$，表明我们针对SGD和c-SGD的衰减率在多项式对数因子范围内是紧致的。值得注意的是，我们的结果表明，与基于有限时间界（其分别对SGD和c-SGD显示 $e^{-\sqrt{t}}$ 和 $e^{-t^{β_p/2}}$，$p \in (1,2]$ 的衰减率）的现有工作相比，长期尾部衰减快了一个数量级。因此，我们揭示了尾部衰减远快于以往认知的机制，为单次运行提供了更强的长期保证。