There currently is a significant interest in understanding the Edge of Stability (EoS) phenomenon, which has been observed in neural networks training, characterized by a non-monotonic decrease of the loss function over epochs, while the sharpness of the loss (spectral norm of the Hessian) progressively approaches and stabilizes around 2/(learning rate). Reasons for the existence of EoS when training using gradient descent have recently been proposed -- a lack of flat minima near the gradient descent trajectory together with the presence of compact forward-invariant sets. In this paper, we show that linear neural networks optimized under a quadratic loss function satisfy the first assumption and also a necessary condition for the second assumption. More precisely, we prove that the gradient descent map is non-singular, the set of global minimizers of the loss function forms a smooth manifold, and the stable minima form a bounded subset in parameter space. Additionally, we prove that if the step-size is too big, then the set of initializations from which gradient descent converges to a critical point has measure zero.
翻译:目前,学界对训练神经网络中观察到的“稳定边缘”(Edge of Stability, EoS)现象有显著研究兴趣。该现象的特征是损失函数在训练轮次中呈非单调下降,而损失函数的锐度(Hessian矩阵的谱范数)逐渐趋近并稳定在2/(学习率)附近。近期已有研究提出使用梯度下降训练时出现EoS的原因——梯度下降轨迹附近缺乏平坦极小值,同时存在紧凑的前向不变集。本文证明,在二次损失函数下优化的线性神经网络满足第一个假设,并为第二个假设提供了必要条件。具体而言,我们证明了梯度下降映射是非奇异的,损失函数全局极小值点集构成光滑流形,且稳定极小值构成参数空间中的有界子集。此外,我们证明若步长过大,则梯度下降收敛至临界点的初始点集测度为零。