This work introduces a hybrid non-Euclidean optimization method which generalizes gradient norm clipping by combining steepest descent and conditional gradient approaches. The method achieves the best of both worlds by establishing a descent property under a generalized notion of ($L_0$,$L_1$)-smoothness. Weight decay is incorporated in a principled manner by identifying a connection to the Frank-Wolfe short step. In the stochastic case, we show an order optimal $O(n^{-1/4})$ convergence rate by leveraging a momentum based gradient estimator. We discuss how to instantiate the algorithms for deep learning, which we dub Clipped Scion, and demonstrate their properties on image classification and language modeling. The code is available at https://github.com/LIONS-EPFL/ClippedScion.
翻译:本文提出了一种混合非欧几里得优化方法,该方法通过结合最速下降法与条件梯度法,推广了梯度范数裁剪技术。该方法在广义的$(L_0,L_1)$-光滑性概念下建立了下降性质,从而实现了两种方法的优势互补。通过建立与Frank-Wolfe短步长的联系,我们以原理性的方式引入了权重衰减。在随机优化情形下,我们利用基于动量的梯度估计器,证明了阶数最优的$O(n^{-1/4})$收敛速率。我们讨论了如何将该算法实例化用于深度学习(我们将其命名为Clipped Scion),并在图像分类和语言建模任务上验证了其特性。代码可在 https://github.com/LIONS-EPFL/ClippedScion 获取。