We propose a tuning-free dynamic SGD step size formula, which we call Distance over Gradients (DoG). The DoG step sizes depend on simple empirical quantities (distance from the initial point and norms of gradients) and have no ``learning rate'' parameter. Theoretically, we show that a slight variation of the DoG formula enjoys strong parameter-free convergence guarantees for stochastic convex optimization assuming only \emph{locally bounded} stochastic gradients. Empirically, we consider a broad range of vision and language transfer learning tasks, and show that DoG's performance is close to that of SGD with tuned learning rate. We also propose a per-layer variant of DoG that generally outperforms tuned SGD, approaching the performance of tuned Adam. A PyTorch implementation is available at https://github.com/formll/dog
翻译:我们提出了一种无需调参的动态SGD步长公式,称为“梯度距离”(Distance over Gradients, DoG)。DoG步长仅依赖于简单的经验量(初始点距离与梯度范数),且不包含“学习率”参数。理论上,我们证明DoG公式的轻微变体在仅假设随机梯度局部有界的条件下,对随机凸优化具有强大的无参数收敛保证。实验中,我们考虑了大量视觉与语言迁移学习任务,并表明DoG的性能接近调参后的SGD。此外,我们提出了一种逐层变体DoG,该变体普遍优于调参SGD,接近调参Adam的性能。PyTorch实现见https://github.com/formll/dog。