We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(\eta)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T^2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T^2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.
翻译:我们研究应用于线性可分数据逻辑回归的恒定学习率梯度下降(GD),其中恒定学习率 $\eta$ 足够大,以致损失初始阶段呈现振荡。我们证明,GD 能迅速脱离此初始振荡阶段——在 $\mathcal{O}(\eta)$ 步内——并在后续 $t$ 步后达到 $\tilde{\mathcal{O}}(1 / (\eta t) )$ 的收敛速率。我们的结果表明,给定 $T$ 步的预算,GD 可通过激进的学习率 $\eta:= \Theta( T)$ 实现 $\tilde{\mathcal{O}}(1/T^2)$ 的加速损失,而无需使用动量或可变学习率调度器。我们的证明方法具有普适性,也能处理一般的分类损失函数(其中指数尾部是实现 $\tilde{\mathcal{O}}(1/T^2)$ 加速所必需的)、神经正切核机制下的非线性预测器,以及在合适可分性条件下采用大学习率的在线随机梯度下降(SGD)。