Gradient clipping is a popular modification to standard (stochastic) gradient descent, at every iteration limiting the gradient norm to a certain value $c >0$. It is widely used for example for stabilizing the training of deep learning models (Goodfellow et al., 2016), or for enforcing differential privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping mechanism, its convergence guarantees often require specific values of $c$ and strong noise assumptions. In this paper, we give convergence guarantees that show precise dependence on arbitrary clipping thresholds $c$ and show that our guarantees are tight with both deterministic and stochastic gradients. In particular, we show that (i) for deterministic gradient descent, the clipping threshold only affects the higher-order terms of convergence, (ii) in the stochastic setting convergence to the true optimum cannot be guaranteed under the standard noise assumption, even under arbitrary small step-sizes. We give matching upper and lower bounds for convergence of the gradient norm when running clipped SGD, and illustrate these results with experiments.
翻译:梯度裁剪是对标准(随机)梯度下降法的一种流行改进,在每次迭代中将梯度范数限制在某个值 $c >0$ 以内。该方法广泛应用于深度学习模型的训练稳定性(Goodfellow 等人,2016)以及强制执行差分隐私(Abadi 等人,2016)等场景。尽管裁剪机制广为人知且简单易行,其收敛性保证通常需要对 $c$ 设定特定值,并依赖较强的噪声假设。在本文中,我们给出了收敛性保证,精确展示了与任意裁剪阈值 $c$ 的依赖关系,并证明我们的保证在确定性和随机梯度情形下均为紧致。具体而言,我们表明:(i)对于确定性梯度下降,裁剪阈值仅影响收敛的高阶项;(ii)在随机设定下,即使采用任意小的步长,标准噪声假设也无法保证收敛到真正的最优解。我们为运行裁剪后的 SGD 时梯度范数的收敛性提供了匹配的上界和下界,并通过实验验证了这些结果。