This paper proposes SplitSGD, a new dynamic learning rate schedule for stochastic optimization. This method decreases the learning rate for better adaptation to the local geometry of the objective function whenever a stationary phase is detected, that is, the iterates are likely to bounce at around a vicinity of a local minimum. The detection is performed by splitting the single thread into two and using the inner product of the gradients from the two threads as a measure of stationarity. Owing to this simple yet provably valid stationarity detection, SplitSGD is easy-to-implement and essentially does not incur additional computational cost than standard SGD. Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, can yield better generalization performance than other adaptive gradient methods such as Adam.
翻译:本文提出SplitSGD,一种用于随机优化的新型动态学习率调度方法。该方法在检测到平稳阶段(即迭代过程可能在局部最小值附近来回震荡)时,会降低学习率以更好地适应目标函数的局部几何结构。该检测通过将单线程分裂为双线程,并利用两线程梯度的内积作为平稳性度量来实现。得益于这种简单且可证明有效的平稳性检测,SplitSGD易于实现,且本质上不会比标准SGD增加额外计算成本。通过一系列广泛实验,我们证明该方法既适用于凸问题,也适用于(非凸)神经网络的训练,其性能可与其他随机优化方法相媲美。重要的是,该方法对于广泛的问题具有默认参数设置的高度稳健性,且相比Adam等自适应梯度方法能获得更好的泛化性能。