This paper proposes a new easy-to-implement parameter-free gradient-based optimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is efficient -- matching the convergence rate of optimally tuned gradient descent in convex optimization up to a logarithmic factor without tuning any parameters, and universal -- automatically adapting to both smooth and nonsmooth problems. While popular algorithms following the AdaGrad framework compute a running average of the squared gradients to use for normalization, DoWG maintains a new distance-based weighted version of the running average, which is crucial to achieve the desired properties. To complement our theory, we also show empirically that DoWG trains at the edge of stability, and validate its effectiveness on practical machine learning tasks.
翻译:本文提出了一种新的易于实现的免参数梯度优化器:DoWG(距离加权梯度法)。我们证明DoWG是高效的——在凸优化中,其收敛速度与经过最优调参的梯度下降法相比仅差一个对数因子,且无需调整任何参数;同时具有通用性——能自动适应光滑和非光滑问题。与遵循AdaGrad框架的流行算法通过计算梯度平方的滑动平均值进行归一化不同,DoWG维护了一种新型的基于距离的加权滑动平均值,这是实现预期特性的关键。为补充理论分析,我们还通过实验表明DoWG在训练中处于稳定性边界附近,并在实际机器学习任务中验证了其有效性。