Recent research has observed that in machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) [Cohen, et al., 2021], where the stepsizes are set to be large, resulting in non-monotonic losses induced by the GD iterates. This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime. Despite the presence of local oscillations, we prove that the logistic loss can be minimized by GD with any constant stepsize over a long time scale. Furthermore, we prove that with any constant stepsize, the GD iterates tend to infinity when projected to a max-margin direction (the hard-margin SVM direction) and converge to a fixed vector that minimizes a strongly convex potential when projected to the orthogonal complement of the max-margin direction. In contrast, we also show that in the EoS regime, GD iterates may diverge catastrophically under the exponential loss, highlighting the superiority of the logistic loss. These theoretical findings are in line with numerical simulations and complement existing theories on the convergence and implicit bias of GD, which are only applicable when the stepsizes are sufficiently small.
翻译:近期研究发现,在机器学习优化中,梯度下降(GD)常在稳定性临界点(EoS)处运行[Cohen等,2021],此时步长设置较大,导致GD迭代产生非单调损失。本文研究在EoS机制下,针对线性可分数据,常步长GD进行逻辑回归的收敛性与隐式偏好。尽管存在局部振荡,我们证明在长时间尺度下,任何常步长GD均能最小化逻辑损失。进一步证明,当投影至最大间隔方向(硬间隔SVM方向)时,任何常步长的GD迭代会趋向无穷大;而在最大间隔方向的正交补空间上投影时,则收敛至最小化强凸势函数的固定向量。对比发现,在EoS机制下,指数损失可能导致GD迭代灾难性发散,凸显逻辑损失的优越性。这些理论结果与数值模拟一致,并补充了现有仅适用于极小步长场景的GD收敛性与隐式偏好理论。