The success of modern machine learning is due in part to the adaptive optimization methods that have been developed to deal with the difficulties of training large models over complex datasets. One such method is gradient clipping: a practical procedure with limited theoretical underpinnings. In this work, we study clipping in a least squares problem under streaming SGD. We develop a theoretical analysis of the learning dynamics in the limit of large intrinsic dimension-a model and dataset dependent notion of dimensionality. In this limit we find a deterministic equation that describes the evolution of the loss. We show that with Gaussian noise clipping cannot improve SGD performance. Yet, in other noisy settings, clipping can provide benefits with tuning of the clipping threshold. In these cases, clipping biases updates in a way beneficial to training which cannot be recovered by SGD under any schedule. We conclude with a discussion about the links between high-dimensional clipping and neural network training.
翻译:现代机器学习的成功部分归功于自适应优化方法的发展,这些方法旨在应对在复杂数据集上训练大型模型的困难。梯度裁剪便是其中一种实用方法,但其理论基础尚不完善。本研究在流式随机梯度下降框架下,针对最小二乘问题中的梯度裁剪机制展开分析。我们建立了大内在维度极限下的学习动力学理论分析框架——内在维度是依赖于模型与数据集的维度概念。在此极限条件下,我们推导出描述损失演化的确定性方程。研究表明,在高斯噪声环境下,梯度裁剪无法提升随机梯度下降的性能。然而,在其他噪声场景中,通过合理调节裁剪阈值,梯度裁剪能够带来训练效益。在这些情况下,裁剪通过有利于训练的方式对梯度更新产生偏置效应,这种效应是任何学习率调度策略下的标准随机梯度下降都无法实现的。最后,我们探讨了高维梯度裁剪与神经网络训练之间的内在联系。