Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension', and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.
翻译:训练现代神经网络常依赖大学习率,使其运行在边缘稳定性状态,此时优化动力学呈现振荡与混沌行为。实证表明,该状态通常能提升泛化性能,但其底层机制仍未得到充分理解。本文中,我们将随机优化器建模为随机动力系统,该系统常收敛至一个具有更低内在维度的分形吸引子集合(而非单一不动点)。基于这一关联并受Lyapunov维度理论启发,我们提出了一种名为“锐度维度”的新维度概念,并证明了基于该维度的泛化界。我们的结果表明,混沌状态下的泛化依赖于完整的Hessian谱及其部分行列式结构,这突显出先前研究中所考虑的迹或谱范数无法捕捉的复杂性。在多种MLP与Transformer上的实验验证了我们的理论,同时为近期观测到的“顿悟”现象提供了新洞见。