On the Stability of Gradient Descent for Large Learning Rate

There currently is a significant interest in understanding the Edge of Stability (EoS) phenomenon, which has been observed in neural networks training, characterized by a non-monotonic decrease of the loss function over epochs, while the sharpness of the loss (spectral norm of the Hessian) progressively approaches and stabilizes around 2/(learning rate). Reasons for the existence of EoS when training using gradient descent have recently been proposed -- a lack of flat minima near the gradient descent trajectory together with the presence of compact forward-invariant sets. In this paper, we show that linear neural networks optimized under a quadratic loss function satisfy the first assumption and also a necessary condition for the second assumption. More precisely, we prove that the gradient descent map is non-singular, the set of global minimizers of the loss function forms a smooth manifold, and the stable minima form a bounded subset in parameter space. Additionally, we prove that if the step-size is too big, then the set of initializations from which gradient descent converges to a critical point has measure zero.

翻译：目前，学界对训练神经网络中观察到的“稳定边缘”（Edge of Stability, EoS）现象有显著研究兴趣。该现象的特征是损失函数在训练轮次中呈非单调下降，而损失函数的锐度（Hessian矩阵的谱范数）逐渐趋近并稳定在2/（学习率）附近。近期已有研究提出使用梯度下降训练时出现EoS的原因——梯度下降轨迹附近缺乏平坦极小值，同时存在紧凑的前向不变集。本文证明，在二次损失函数下优化的线性神经网络满足第一个假设，并为第二个假设提供了必要条件。具体而言，我们证明了梯度下降映射是非奇异的，损失函数全局极小值点集构成光滑流形，且稳定极小值构成参数空间中的有界子集。此外，我们证明若步长过大，则梯度下降收敛至临界点的初始点集测度为零。

相关内容

损失函数（机器学习）

关注 10

损失函数，在AI中亦称呼距离函数，度量函数。此处的距离代表的是抽象性的，代表真实数据与预测数据之间的误差。损失函数（loss function）是用来估量你模型的预测值f(x)与真实值Y的不一致程度，它是一个非负实值函数,通常使用L(Y, f(x))来表示，损失函数越小，模型的鲁棒性就越好。损失函数是经验风险函数的核心部分，也是结构风险函数重要组成部分。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日