Optimization in deep learning remains poorly understood, even in the simple setting of deterministic (i.e. full-batch) training. A key difficulty is that much of an optimizer's behavior is implicitly determined by complex oscillatory dynamics, referred to as the "edge of stability." The main contribution of this paper is to show that an optimizer's implicit behavior can be explicitly captured by a "central flow:" a differential equation which models the time-averaged optimization trajectory. We show that these flows can empirically predict long-term optimization trajectories of generic neural networks with a high degree of numerical accuracy. By interpreting these flows, we reveal for the first time 1) the precise sense in which RMSProp adapts to the local loss landscape, and 2) an "acceleration via regularization" mechanism, wherein adaptive optimizers implicitly navigate towards low-curvature regions in which they can take larger steps. This mechanism is key to the efficacy of these adaptive optimizers. Overall, we believe that central flows constitute a promising tool for reasoning about optimization in deep learning.
翻译:深度学习中的优化问题至今仍未得到充分理解,即使在确定性(即全批量)训练这一简单场景下也是如此。一个关键难点在于,优化器的许多行为是由复杂的振荡动力学隐式决定的,这种现象被称为"稳定性边缘"。本文的主要贡献在于证明优化器的隐式行为可以通过"中心流"显式刻画:这是一种对时间平均优化轨迹进行建模的微分方程。我们通过实验证明,这些流能够以较高的数值精度预测通用神经网络的长期优化轨迹。通过解析这些流,我们首次揭示了:1)RMSProp适应局部损失景观的具体机制;2)"通过正则化实现加速"的机理,即自适应优化器会隐式地导航至低曲率区域,从而允许采用更大的步长。该机制是这些自适应优化器取得成效的关键。总体而言,我们认为中心流为理解深度学习中的优化问题提供了具有前景的分析工具。