Gradient clipping is a popular modification to standard (stochastic) gradient descent, at every iteration limiting the gradient norm to a certain value $c >0$. It is widely used for example for stabilizing the training of deep learning models (Goodfellow et al., 2016), or for enforcing differential privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping mechanism, its convergence guarantees often require specific values of $c$ and strong noise assumptions. In this paper, we give convergence guarantees that show precise dependence on arbitrary clipping thresholds $c$ and show that our guarantees are tight with both deterministic and stochastic gradients. In particular, we show that (i) for deterministic gradient descent, the clipping threshold only affects the higher-order terms of convergence, (ii) in the stochastic setting convergence to the true optimum cannot be guaranteed under the standard noise assumption, even under arbitrary small step-sizes. We give matching upper and lower bounds for convergence of the gradient norm when running clipped SGD, and illustrate these results with experiments.
翻译:梯度截断是标准(随机)梯度下降的一种常见改进方法,在每次迭代中将梯度范数限制在某个值 $c >0$ 以内。该方法广泛应用于例如稳定深度学习模型的训练(Goodfellow 等人,2016)或保障差分隐私(Abadi 等人,2016)。尽管截断机制广受欢迎且简单易行,但其收敛保证通常需要特定的 $c$ 值和强噪声假设。本文给出了与任意截断阈值 $c$ 精确相关的收敛保证,并证明我们的保证在确定性梯度与随机梯度情形下均为紧致的。具体而言,我们表明:(i) 对于确定性梯度下降,截断阈值仅影响收敛的高阶项;(ii) 在随机设置下,即使步长任意小,标准噪声假设也无法保证收敛到真正的最优解。我们给出了裁剪后的SGD在梯度范数收敛方面的匹配上界与下界,并通过实验对这些结果进行了阐释。