This paper proposes a new easy-to-implement parameter-free gradient-based optimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is efficient -- matching the convergence rate of optimally tuned gradient descent in convex optimization up to a logarithmic factor without tuning any parameters, and universal -- automatically adapting to both smooth and nonsmooth problems. While popular algorithms such as AdaGrad, Adam, or DoG compute a running average of the squared gradients, DoWG maintains a new distance-based weighted version of the running average, which is crucial to achieve the desired properties. To our best knowledge, DoWG is the first parameter-free, efficient, and universal algorithm that does not require backtracking search procedures. It is also the first parameter-free AdaGrad style algorithm that adapts to smooth optimization. To complement our theory, we also show empirically that DoWG trains at the edge of stability, and validate its effectiveness on practical machine learning tasks. This paper further uncovers the underlying principle behind the success of the AdaGrad family of algorithms by presenting a novel analysis of Normalized Gradient Descent (NGD), that shows NGD adapts to smoothness when it exists, with no change to the stepsize. This establishes the universality of NGD and partially explains the empirical observation that it trains at the edge of stability in a much more general setup compared to standard gradient descent. The latter might be of independent interest to the community.
翻译:本文提出了一种新的易于实现的免参数梯度优化器:DoWG(距离加权梯度法)。我们证明了DoWG具有高效性——在无需调参的情况下,其凸优化收敛速度与最优调参梯度下降法仅相差一个对数因子;同时具有通用性——能自动适应光滑与非光滑两类问题。与AdaGrad、Adam或DoG等通过计算梯度平方的滑动平均的流行算法不同,DoWG维护了一种基于距离的新型加权滑动平均,这是实现上述特性的关键。据我们所知,DoWG是首个无需回溯搜索过程、兼具高效性与通用性的免参数算法,也是首个能自适应光滑优化的免参数AdaGrad风格算法。为补充理论分析,我们通过实验表明DoWG在稳定性边界附近进行训练,并验证了其在实际机器学习任务中的有效性。本文还通过提出归一化梯度下降法(NGD)的新颖分析,揭示了AdaGrad系列算法成功背后的基本原理:当问题存在光滑性时,NGD能在不改变步长的情况下自适应光滑性。这确立了NGD的通用性,并部分解释了实验观察结果——相比标准梯度下降法,NGD能在更广泛的设置下在稳定性边界附近进行训练。后者可能对学术界具有独立的研究价值。