Classical optimization theory requires a small step-size for gradient-based methods to converge. Nevertheless, recent findings challenge the traditional idea by empirically demonstrating Gradient Descent (GD) converges even when the step-size $\eta$ exceeds the threshold of $2/L$, where $L$ is the global smooth constant. This is usually known as the Edge of Stability (EoS) phenomenon. A widely held belief suggests that an objective function with subquadratic growth plays an important role in incurring EoS. In this paper, we provide a more comprehensive answer by considering the task of finding linear interpolator $\beta \in R^{d}$ for regression with loss function $l(\cdot)$, where $\beta$ admits parameterization as $\beta = w^2_{+} - w^2_{-}$. Contrary to the previous work that suggests a subquadratic $l$ is necessary for EoS, our novel finding reveals that EoS occurs even when $l$ is quadratic under proper conditions. This argument is made rigorous by both empirical and theoretical evidence, demonstrating the GD trajectory converges to a linear interpolator in a non-asymptotic way. Moreover, the model under quadratic $l$, also known as a depth-$2$ diagonal linear network, remains largely unexplored under the EoS regime. Our analysis then sheds some new light on the implicit bias of diagonal linear networks when a larger step-size is employed, enriching the understanding of EoS on more practical models.
翻译:经典优化理论要求基于梯度的方法采用小步长以保证收敛。然而,近期研究通过实验证明,即使步长$\eta$超过$2/L$的阈值(其中$L$为全局光滑常数),梯度下降法仍能收敛,这一现象挑战了传统观点,通常被称为稳定性边界现象。普遍观点认为,具有次二次增长性质的目标函数是引发稳定性边界现象的关键因素。本文通过研究回归任务中线性插值器$\beta \in R^{d}$的求解问题(损失函数为$l(\cdot)$,且$\beta$可参数化为$\beta = w^2_{+} - w^2_{-}$),给出了更全面的解释。与先前认为次二次损失函数$l$是稳定性边界必要条件的观点不同,我们的新发现表明:在适当条件下,即使$l$为二次函数时稳定性边界现象仍会发生。这一论断通过实验与理论证据得以严格验证,证明梯度下降轨迹能以非渐近方式收敛至线性插值器。此外,二次损失函数下的模型(即深度为$2$的对角线性网络)在稳定性边界区域的研究尚不充分。我们的分析进一步揭示了大步长条件下对角线性网络的隐式偏好特性,为理解更实用模型中的稳定性边界现象提供了新的视角。