Gradient clipping is a standard training technique used in deep learning applications such as large-scale language modeling to mitigate exploding gradients. Recent experimental studies have demonstrated a fairly special behavior in the smoothness of the training objective along its trajectory when trained with gradient clipping. That is, the smoothness grows with the gradient norm. This is in clear contrast to the well-established assumption in folklore non-convex optimization, a.k.a. $L$--smoothness, where the smoothness is assumed to be bounded by a constant $L$ globally. The recently introduced $(L_0,L_1)$--smoothness is a more relaxed notion that captures such behavior in non-convex optimization. In particular, it has been shown that under this relaxed smoothness assumption, SGD with clipping requires $O(\epsilon^{-4})$ stochastic gradient computations to find an $\epsilon$--stationary solution. In this paper, we employ a variance reduction technique, namely SPIDER, and demonstrate that for a carefully designed learning rate, this complexity is improved to $O(\epsilon^{-3})$ which is order-optimal. Our designed learning rate comprises the clipping technique to mitigate the growing smoothness. Moreover, when the objective function is the average of $n$ components, we improve the existing $O(n\epsilon^{-2})$ bound on the stochastic gradient complexity to $O(\sqrt{n} \epsilon^{-2} + n)$, which is order-optimal as well. In addition to being theoretically optimal, SPIDER with our designed parameters demonstrates comparable empirical performance against variance-reduced methods such as SVRG and SARAH in several vision tasks.
翻译:梯度剪切是深度学习中(如大规模语言建模)用于缓解梯度爆炸的标准训练技术。近期实验研究表明,使用梯度剪切训练时,训练目标沿轨迹的光滑性呈现出相当特殊的行为:光滑性随梯度范数增长。这与经典非凸优化中广泛认可的$L$-光滑性假设(即光滑性全局有界于常数$L$)形成鲜明对比。新近提出的$(L_0,L_1)$-光滑性是一种更宽松的概念,能够捕捉非凸优化中的此类行为。特别地,已有证明表明,在此宽松光滑性假设下,带剪切的SGD需$O(\epsilon^{-4})$次随机梯度计算才能找到$\epsilon$-稳定解。本文采用方差缩减技术(即SPIDER),并证明在精心设计的学习率下,该复杂度可改进至$O(\epsilon^{-3})$,达到阶最优。我们设计的学习率包含剪切技术以缓解光滑性增长。此外,当目标函数为$n$个分量均值时,我们将随机梯度复杂度从现有$O(n\epsilon^{-2})$界改进至$O(\sqrt{n}\epsilon^{-2} + n)$,同样达到阶最优。除理论最优性外,在多个视觉任务中,采用我们设计参数的SPIDER展现了与SVRG、SARAH等方差缩减方法可比的实证性能。