Curvature information -- particularly, the largest eigenvalue of the loss Hessian, known as the sharpness -- often forms the basis for learning rate tuners. However, recent work has shown that the curvature information undergoes complex dynamics during training, going from a phase of increasing sharpness to eventual stabilization. We analyze the closed-loop feedback effect between learning rate tuning and curvature. We find that classical learning rate tuners may yield greater one-step loss reduction, yet they ultimately underperform in the long term when compared to constant learning rates in the full batch regime. These models break the stabilization of the sharpness, which we explain using a simplified model of the joint dynamics of the learning rate and the curvature. To further investigate these effects, we introduce a new learning rate tuning method, Curvature Dynamics Aware Tuning (CDAT), which prioritizes long term curvature stabilization over instantaneous progress on the objective. In the full batch regime, CDAT shows behavior akin to prefixed warm-up schedules on deep learning objectives, outperforming tuned constant learning rates. In the mini batch regime, we observe that stochasticity introduces confounding effects that explain the previous success of some learning rate tuners at appropriate batch sizes. Our findings highlight the critical role of understanding the joint dynamics of the learning rate and curvature, beyond greedy minimization, to diagnose failures and design effective adaptive learning rate tuners.
翻译:曲率信息——特别是损失函数海森矩阵的最大特征值(称为锐度)——常作为学习率调节器的基础。然而,近期研究表明,曲率信息在训练过程中经历复杂的动态变化,从锐度增加阶段最终趋于稳定。我们分析了学习率调节与曲率之间的闭环反馈效应。研究发现,经典学习率调节器虽能实现更大的单步损失减少,但在全批次训练模式下,长期表现最终劣于恒定学习率。这些模型破坏了锐度的稳定性,我们通过构建学习率与曲率联合动态的简化模型对此进行了解释。为深入探究这些效应,我们提出了一种新的学习率调节方法——曲率动态感知调节(CDAT),该方法优先考虑长期曲率稳定性而非目标函数的即时进展。在全批次训练模式下,CDAT在深度学习目标上表现出类似预设预热调度的行为,其性能优于经调优的恒定学习率。在小批次训练模式下,我们观察到随机性引入了混杂效应,这解释了先前某些学习率调节器在适当批次大小时取得成功的原因。我们的研究结果强调,理解学习率与曲率的联合动态对于诊断故障和设计有效的自适应学习率调节器至关重要,这超越了贪婪最小化的传统视角。