Large learning rates, when applied to gradient descent for nonconvex optimization, yield various implicit biases including the edge of stability (Cohen et al., 2021), balancing (Wang et al., 2022), and catapult (Lewkowycz et al., 2020). These phenomena cannot be well explained by classical optimization theory. Though significant theoretical progress has been made in understanding these implicit biases, it remains unclear for which objective functions would they be more likely. This paper provides an initial step in answering this question and also shows that these implicit biases are in fact various tips of the same iceberg. To establish these results, we develop a global convergence theory under large learning rates, for a family of nonconvex functions without globally Lipschitz continuous gradient, which was typically assumed in existing convergence analysis. Specifically, these phenomena are more likely to occur when the optimization objective function has good regularity. This regularity, together with gradient descent using a large learning rate that favors flatter regions, results in these nontrivial dynamical behaviors. Another corollary is the first non-asymptotic convergence rate bound for large-learning-rate gradient descent optimization of nonconvex functions. Although our theory only applies to specific functions so far, the possibility of extrapolating it to neural networks is also experimentally validated, for which different choices of loss, activation functions, and other techniques such as batch normalization can all affect regularity significantly and lead to very different training dynamics.
翻译:大学习率应用于非凸优化梯度下降时,会产生多种隐式偏好,包括稳定性边缘(Cohen等,2021)、平衡(Wang等,2022)和弹射效应(Lewkowycz等,2020)。经典优化理论难以充分解释这些现象。尽管理解这些隐式偏好的理论已取得重要进展,但尚不清楚它们更可能出现在哪些目标函数中。本文为回答这一问题迈出了初步一步,同时揭示这些隐式偏好实乃同一枚硬币的不同侧面。为建立这些结论,我们针对一类不具备全局Lipschitz连续梯度(现有收敛分析通常假设该条件)的非凸函数,发展了大学习率下的全局收敛理论。具体而言,当优化目标函数具有良好正则性时,这些现象更易出现。这种正则性结合偏好平坦区域的大学习率梯度下降,催生了上述非平凡动力学行为。另一推论是首个针对大学习率梯度下降优化非凸函数的非渐近收敛率界。尽管目前理论仅适用于特定函数,但实验验证了将其拓展至神经网络的可行性——其中损失函数、激活函数及批归一化等技术的不同选择均会显著影响正则性,进而导致迥异的训练动力学特性。