Integrating adaptive learning rate and momentum techniques into SGD leads to a large class of efficiently accelerated adaptive stochastic algorithms, such as AdaGrad, RMSProp, Adam, AccAdaGrad, \textit{etc}. In spite of their effectiveness in practice, there is still a large gap in their theories of convergences, especially in the difficult non-convex stochastic setting. To fill this gap, we propose \emph{weighted AdaGrad with unified momentum}, dubbed AdaUSM, which has the main characteristics that (1) it incorporates a unified momentum scheme which covers both the heavy ball momentum and the Nesterov accelerated gradient momentum; (2) it adopts a novel weighted adaptive learning rate that can unify the learning rates of AdaGrad, AccAdaGrad, Adam, and RMSProp. Moreover, when we take polynomially growing weights in AdaUSM, we obtain its $\mathcal{O}(\log(T)/\sqrt{T})$ convergence rate in the non-convex stochastic setting. We also show that the adaptive learning rates of Adam and RMSProp correspond to taking exponentially growing weights in AdaUSM, thereby providing a new perspective for understanding Adam and RMSProp. Lastly, comparative experiments of AdaUSM against SGD with momentum, AdaGrad, AdaEMA, Adam, and AMSGrad on various deep learning models and datasets are also carried out.
翻译:将自适应学习率和动量技术融入随机梯度下降法(SGD),得到了一大类高效加速的自适应随机算法,例如AdaGrad、RMSProp、Adam、AccAdaGrad等。尽管这些算法在实践中表现有效,但其收敛理论仍存在重大缺口,尤其是在困难的非凸随机场景中。为填补这一缺口,我们提出了**带统一动量的加权AdaGrad**(简称AdaUSM),其主要特征包括:(1)采用覆盖重球动量和Nesterov加速梯度动量的统一动量方案;(2)采用新颖的加权自适应学习率,可统一AdaGrad、AccAdaGrad、Adam和RMSProp的学习率。此外,当在AdaUSM中采用多项式增长的权重时,我们在非凸随机场景下得到了其$\mathcal{O}(\log(T)/\sqrt{T})$收敛率。我们还证明了Adam和RMSProp的自适应学习率对应于AdaUSM中采用指数增长的权重,从而为理解Adam和RMSProp提供了新视角。最后,针对多种深度学习模型和数据集,开展了AdaUSM与带动量SGD、AdaGrad、AdaEMA、Adam及AMSGrad的对比实验。