Recent research has observed that in machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) [Cohen, et al., 2021], where the stepsizes are set to be large, resulting in non-monotonic losses induced by the GD iterates. This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime. Despite the presence of local oscillations, we prove that the logistic loss can be minimized by GD with \emph{any} constant stepsize over a long time scale. Furthermore, we prove that with \emph{any} constant stepsize, the GD iterates tend to infinity when projected to a max-margin direction (the hard-margin SVM direction) and converge to a fixed vector that minimizes a strongly convex potential when projected to the orthogonal complement of the max-margin direction. In contrast, we also show that in the EoS regime, GD iterates may diverge catastrophically under the exponential loss, highlighting the superiority of the logistic loss. These theoretical findings are in line with numerical simulations and complement existing theories on the convergence and implicit bias of GD for logistic regression, which are only applicable when the stepsizes are sufficiently small.
翻译:近期研究观察到,在机器学习优化中,梯度下降(GD)常在稳定性边缘(EoS)[Cohen等,2021]运行,即步长设置较大导致GD迭代过程产生非单调损失。本文研究了线性可分数据逻辑回归在EoS机制下常步长GD的收敛性与隐式偏差。尽管存在局部振荡,我们证明在长时间尺度上,采用任意常步长的GD均可最小化逻辑损失。进一步表明,在任意常步长下,GD迭代在最大间隔方向(硬间隔SVM方向)上的投影趋于无穷,而在最大间隔方向正交补上的投影收敛于最小化强凸势的固定向量。与之形成对比的是,在EoS机制下GD迭代在指数损失下可能出现灾难性发散,凸显了逻辑损失的优越性。这些理论发现与数值模拟相吻合,补充了仅适用于充分小步长场景下逻辑回归GD收敛性与隐式偏差的现有理论。