Gradient clipping is a standard training technique used in deep learning applications such as large-scale language modeling to mitigate exploding gradients. Recent experimental studies have demonstrated a fairly special behavior in the smoothness of the training objective along its trajectory when trained with gradient clipping. That is, the smoothness grows with the gradient norm. This is in clear contrast to the well-established assumption in folklore non-convex optimization, a.k.a. $L$-smoothness, where the smoothness is assumed to be bounded by a constant $L$ globally. The recently introduced $(L_0,L_1)$-smoothness is a more relaxed notion that captures such behavior in non-convex optimization. In particular, it has been shown that under this relaxed smoothness assumption, SGD with clipping requires $O(\epsilon^{-4})$ stochastic gradient computations to find an $\epsilon$-stationary solution. In this paper, we employ a variance reduction technique, namely SPIDER, and demonstrate that for a carefully designed learning rate, this complexity is improved to $O(\epsilon^{-3})$ which is order-optimal. The corresponding learning rate comprises the clipping technique to mitigate the growing smoothness. Moreover, when the objective function is the average of $n$ components, we improve the existing $O(n\epsilon^{-2})$ bound on the stochastic gradient complexity to order-optimal $O(\sqrt{n} \epsilon^{-2} + n)$.
翻译:梯度裁剪是一种标准训练技术,广泛应用于大规模语言模型等深度学习场景中,以缓解梯度爆炸问题。近期实验研究表明,在使用梯度裁剪训练时,目标函数沿训练轨迹的光滑性表现出相当特殊的行为:即光滑性随梯度范数增长。这与传统非凸优化中广为接受的假设(即$L$-光滑性,其中光滑性被假设为全局有界于常数$L$)形成鲜明对比。最近提出的$(L_0,L_1)$-光滑性是一种更宽松的概念,能够捕捉非凸优化中的此类行为。特别地,已证明在该松弛光滑假设下,带裁剪的SGD需要$O(\epsilon^{-4})$次随机梯度计算才能找到$\epsilon$-稳定解。本文采用方差缩减技术SPIDER,并证明对于精心设计的学习率,该复杂度可改进为$O(\epsilon^{-3})$,达到阶最优。相应的学习率结合了裁剪技术以缓解光滑性增长。此外,当目标函数为$n$个分量的平均值时,我们将现有的随机梯度复杂度界$O(n\epsilon^{-2})$改进为阶最优的$O(\sqrt{n} \epsilon^{-2} + n)$。