How can we understand gradient-based training over non-convex landscapes? The edge of stability phenomenon, introduced in Cohen et al. (2021), indicates that the answer is not so simple: namely, gradient descent (GD) with large step sizes often diverges away from the gradient flow. In this regime, the "Central Flow", recently proposed in Cohen et al. (2025), provides an accurate ODE approximation to the GD dynamics over many architectures. In this work, we propose Rod Flow, an alternative ODE approximation, which carries the following advantages: (1) it rests on a principled derivation stemming from a physical picture of GD iterates as an extended one-dimensional object -- a "rod"; (2) it better captures GD dynamics for simple toy examples and matches the accuracy of Central Flow for representative neural network architectures, and (3) is explicit and cheap to compute. Theoretically, we prove that Rod Flow correctly predicts the critical sharpness threshold and explains self-stabilization in quartic potentials. We validate our theory with a range of numerical experiments.
翻译:如何理解基于梯度的训练在非凸景观上的过程?Cohen等人(2021年)提出的稳定性边缘现象表明,答案并非如此简单:即,采用大步长的梯度下降(GD)常常会偏离梯度流。在此机制下,Cohen等人(2025年)最近提出的“中心流”为多种架构上的GD动力学提供了精确的常微分方程近似。在本工作中,我们提出杆流,一种替代的常微分方程近似,它具有以下优势:(1)其基于一个原理性推导,该推导源于将GD迭代点视为一个扩展的一维物体——一根“杆”——的物理图像;(2)对于简单的玩具示例,它能更好地捕捉GD动力学,并且在代表性的神经网络架构上能达到与中心流相当的精度;(3)其表达式显式且计算成本低廉。理论上,我们证明了杆流能正确预测临界锐度阈值并解释四次势中的自稳定现象。我们通过一系列数值实验验证了我们的理论。