This paper proposes a new easy-to-implement parameter-free gradient-based optimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is efficient -- matching the convergence rate of optimally tuned gradient descent in convex optimization up to a logarithmic factor without tuning any parameters, and universal -- automatically adapting to both smooth and nonsmooth problems. While popular algorithms following the AdaGrad framework compute a running average of the squared gradients to use for normalization, DoWG maintains a new distance-based weighted version of the running average, which is crucial to achieve the desired properties. To complement our theory, we also show empirically that DoWG trains at the edge of stability, and validate its effectiveness on practical machine learning tasks.
翻译:本文提出了一种易于实现的无参数梯度优化器:DoWG(基于加权梯度的距离)。我们证明了DoWG的高效性——在凸优化中,其收敛率与最优调参的梯度下降法相比仅相差一个对数因子,且无需调整任何参数;同时,它具有通用性——能够自动适应光滑与非光滑问题。与遵循AdaGrad框架的常用算法(通过计算平方梯度的滑动平均值进行归一化)不同,DoWG维护了一种基于距离的滑动平均加权版本,这是实现上述特性的关键。为补充理论分析,我们还通过实验表明,DoWG在训练过程中处于稳定性边界,并在实际机器学习任务中验证了其有效性。