Beyond the Edge of Stability via Two-step Gradient Updates

Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a `bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called ``Edge of Stability'' (EoS), where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability and oscillatory behavior. The incipient theoretical analysis of this phenomena has mainly focused in the overparametrised regime, where the effect of choosing a large learning rate may be associated to a `Sharpness-Minimisation' implicit regularisation within the manifold of minimisers, under appropriate asymptotic limits. In contrast, in this work we directly examine the conditions for such unstable convergence, focusing on simple, yet representative, learning problems, via analysis of two-step gradient updates. Specifically, we characterize a local condition involving third-order derivatives that guarantees existence and convergence to fixed points of the two-step updates, and leverage such property in a teacher-student setting, under population loss. Finally, starting from Matrix Factorization, we provide observations of period-2 orbit of GD in high-dimensional settings with intuition of its dynamics, along with exploration into more general settings.

翻译：梯度下降（GD）凭借其在高维空间中的可扩展性和计算效率，已成为现代机器学习的强大工具。它能够保证找到局部极小值的条件仅限于具有Lipschitz梯度的损失函数——在此条件下，梯度下降可被视为底层梯度流的“正统”离散化形式。然而，许多涉及过参数化模型的机器学习场景并不属于此类问题，这促使研究者探索所谓的“稳定性边缘”（Edge of Stability, EoS）之外的情形，其中步长跨越了与Lipschitz常数成反比的容许阈值。令人惊讶的是，大量实验观察到无论局部不稳定性和振荡行为如何，GD仍能收敛。针对这一现象的初步理论分析主要聚焦于过参数化机制，在适当渐近极限下，大学习率的选择可能对应着极小值流形上的“锐度最小化”隐式正则化效应。相比之下，本文通过分析两步梯度更新，直接在简单且具代表性的学习问题中检验这种不稳定收敛的条件。具体而言，我们刻画了一个涉及三阶导数的局部条件，该条件保证两步更新不动点的存在性与收敛性，并在总体损失下的师生模型设置中利用该性质。最后，我们从矩阵分解出发，提供高维设置下GD周期二轨道的观测结果及其动力学直觉，并探索更一般的情景。