We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(\eta)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T^2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T^2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.
翻译:考虑将常数步长梯度下降(GD)应用于线性可分数据的逻辑回归问题,其中常数步长$\eta$如此之大,以至于损失函数初始阶段发生振荡。我们证明,GD 能在$\mathcal{O}(\eta)$步内快速退出该初始振荡阶段,并在额外$t$步后实现$\tilde{\mathcal{O}}(1 / (\eta t) )$的收敛速率。我们的结果表明,给定$T$步的计算预算,GD 可通过采用激进步长$\eta:= \Theta( T)$实现加速损失$\tilde{\mathcal{O}}(1/T^2)$,且无需使用动量或可变步长调度器。我们的证明技术具有普适性,同样适用于一般分类损失函数(指数尾部性质是实现$\tilde{\mathcal{O}}(1/T^2)$加速的必要条件)、神经正切核机制下的非线性预测器,以及在线随机梯度下降(SGD)在适当可分性条件下的大步长情形。