Large learning rates, when applied to gradient descent for nonconvex optimization, yield various implicit biases including the edge of stability (Cohen et al., 2021), balancing (Wang et al., 2022), and catapult (Lewkowycz et al., 2020). These phenomena cannot be well explained by classical optimization theory. Though significant theoretical progress has been made in understanding these implicit biases, it remains unclear for which objective functions would they occur. This paper provides an initial step in answering this question, namely that these implicit biases are in fact various tips of the same iceberg. They occur when the objective function of optimization has some good regularity, which, in combination with a provable preference of large learning rate gradient descent for moving toward flatter regions, results in these nontrivial dynamical phenomena. To establish this result, we develop a new global convergence theory under large learning rates, for a family of nonconvex functions without globally Lipschitz continuous gradient, which was typically assumed in existing convergence analysis. A byproduct is the first non-asymptotic convergence rate bound for large-learning-rate gradient descent optimization of nonconvex functions. We also validate our theory with experiments on neural networks, where different losses, activation functions, and batch normalization all can significantly affect regularity and lead to very different training dynamics.
翻译:大学习率应用于非凸优化的梯度下降时,会产生多种隐式偏差,包括稳定边缘(Cohen等,2021)、平衡(Wang等,2022)和弹射(Lewkowycz等,2020)。经典优化理论难以充分解释这些现象。尽管在理解这些隐式偏差方面已取得显著理论进展,但它们究竟在何种目标函数下会出现,仍不清楚。本文为回答这一问题迈出了初步一步,即这些隐式偏差实际上是同一冰山的不同尖端。当优化目标函数具有良好的正则性时,大学习率梯度下降倾向于移向更平坦区域这一可证明的特性与之结合,便产生了这些非平凡的动力学现象。为建立这一结果,我们针对一类不具有全局Lipschitz连续梯度的非凸函数(这在现有收敛分析中通常被假设),发展了一种大学习率下的全局收敛新理论。其副产品是首个针对非凸函数大学习率梯度下降优化的非渐近收敛率界。我们还通过神经网络实验验证了理论,其中不同损失函数、激活函数和批归一化均能显著影响正则性,并导致截然不同的训练动力学。