We present and analyze a novel regularized form of the gradient clipping algorithm, proving that it converges to global minima of the loss surface of deep neural networks under the squared loss, provided that the layers are of sufficient width. The algorithm presented here, dubbed $\delta-$GClip, introduces a modification to gradient clipping that leads to a first-of-its-kind example of a step size scheduling for gradient descent that provably minimizes training losses of deep neural nets. We also present empirical evidence that our theoretically founded $\delta-$GClip algorithm is competitive with the state-of-the-art deep learning heuristics on various neural architectures including modern transformer based architectures. The modification we do to standard gradient clipping is designed to leverage the PL* condition, a variant of the Polyak-Lojasiewicz inequality which was recently proven to be true for sufficiently wide neural networks at any depth within a neighbourhood of the initialization.
翻译:我们提出并分析了一种新颖的正则化梯度裁剪算法,证明在平方损失下,只要网络层具有足够的宽度,该算法能够收敛至深度神经网络损失曲面的全局最小值。本文提出的算法称为$\delta-$GClip,通过对梯度裁剪进行改进,首次实现了可证明最小化深度神经网络训练损失的梯度下降步长调度方案。我们还提供了实验证据,表明基于理论构建的$\delta-$GClip算法在各种神经网络架构(包括基于现代Transformer的架构)上与最先进的深度学习启发式方法具有竞争力。我们对标准梯度裁剪的改进旨在利用PL*条件——这是Polyak-Lojasiewicz不等式的一种变体,最近被证明在初始化点邻域内对任意深度的足够宽神经网络成立。