Momentum has become a crucial component in deep learning optimizers, necessitating a comprehensive understanding of when and why it accelerates stochastic gradient descent (SGD). To address the question of ''when'', we establish a meaningful comparison framework that examines the performance of SGD with Momentum (SGDM) under the \emph{effective learning rates} $\eta_{ef}$, a notion unifying the influence of momentum coefficient $\mu$ and batch size $b$ over learning rate $\eta$. In the comparison of SGDM and SGD with the same effective learning rate and the same batch size, we observe a consistent pattern: when $\eta_{ef}$ is small, SGDM and SGD experience almost the same empirical training losses; when $\eta_{ef}$ surpasses a certain threshold, SGDM begins to perform better than SGD. Furthermore, we observe that the advantage of SGDM over SGD becomes more pronounced with a larger batch size. For the question of ``why'', we find that the momentum acceleration is closely related to \emph{abrupt sharpening} which is to describe a sudden jump of the directional Hessian along the update direction. Specifically, the misalignment between SGD and SGDM happens at the same moment that SGD experiences abrupt sharpening and converges slower. Momentum improves the performance of SGDM by preventing or deferring the occurrence of abrupt sharpening. Together, this study unveils the interplay between momentum, learning rates, and batch sizes, thus improving our understanding of momentum acceleration.
翻译:动量已成为深度学习优化器中的关键组成部分,这促使我们全面理解它何时以及为何能加速随机梯度下降(SGD)。针对“何时”的问题,我们建立了一个有意义的比较框架,在有效学习率$\eta_{ef}$下考察带动量SGD(SGDM)的性能。有效学习率统一了动量系数$\mu$和批量大小$b$对学习率$\eta$的影响。在比较相同有效学习率和相同批量大小的SGDM与SGD时,我们观察到一个一致模式:当$\eta_{ef}$较小时,SGDM与SGD的经验训练损失几乎相同;当$\eta_{ef}$超过某个阈值后,SGDM开始优于SGD。此外,我们发现SGDM相对于SGD的优势随批量增大而更加显著。针对“为何”的问题,我们发现动量加速与“突增锐化”密切相关——即沿更新方向的黑塞矩阵方向急剧跃升的现象。具体而言,SGD与SGDM的差异恰发生在SGD经历突增锐化且收敛变慢的时刻。动量通过阻止或延缓突增锐化的出现来提升SGDM的性能。本研究共同揭示了动量、学习率与批量大小之间的相互作用,从而加深了对动量加速的理解。