This paper proposes a new easy-to-implement parameter-free gradient-based optimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is efficient -- matching the convergence rate of optimally tuned gradient descent in convex optimization up to a logarithmic factor without tuning any parameters, and universal -- automatically adapting to both smooth and nonsmooth problems. While popular algorithms following the AdaGrad framework compute a running average of the squared gradients to use for normalization, DoWG maintains a new distance-based weighted version of the running average, which is crucial to achieve the desired properties. To complement our theory, we also show empirically that DoWG trains at the edge of stability, and validate its effectiveness on practical machine learning tasks.
翻译:本文提出了一种易于实现的无参数梯度优化器:DoWG(基于加权梯度的距离度量)。我们证明了DoWG的高效性——在凸优化中,其收敛速度与经过最优调参的梯度下降法相比仅差一个对数因子,且无需任何参数调优;同时具备通用性——能自动适应光滑与非光滑问题。与遵循AdaGrad框架的流行算法通过计算平方梯度的滑动平均进行归一化不同,DoWG维护了一种基于距离的加权滑动平均新度量,这对实现目标性质至关重要。为完善理论,我们还通过实验表明DoWG能在稳定性边界上进行训练,并在实际机器学习任务中验证了其有效性。