We study gradient descent (GD) dynamics on logistic regression problems with large, constant step sizes. For linearly-separable data, it is known that GD converges to the minimizer with arbitrarily large step sizes, a property which no longer holds when the problem is not separable. In fact, the behaviour can be much more complex -- a sequence of period-doubling bifurcations begins at the critical step size $2/\lambda$, where $\lambda$ is the largest eigenvalue of the Hessian at the solution. Using a smaller-than-critical step size guarantees convergence if initialized nearby the solution: but does this suffice globally? In one dimension, we show that a step size less than $1/\lambda$ suffices for global convergence. However, for all step sizes between $1/\lambda$ and the critical step size $2/\lambda$, one can construct a dataset such that GD converges to a stable cycle. In higher dimensions, this is actually possible even for step sizes less than $1/\lambda$. Our results show that although local convergence is guaranteed for all step sizes less than the critical step size, global convergence is not, and GD may instead converge to a cycle depending on the initialization.
翻译:本研究探讨了在逻辑回归问题中采用恒定大步长时梯度下降(GD)的动态行为。对于线性可分数据,已知梯度下降能以任意大步长收敛至最小化点,但这一性质在问题不可分时不再成立。实际上,其行为可能更为复杂——当步长超过临界值$2/\lambda$时(其中$\lambda$为解点处海森矩阵的最大特征值),系统会出现倍周期分岔序列。若采用小于临界值的步长,在初始点靠近解时能保证收敛:但这是否适用于全局情况?在一维情形中,我们证明步长小于$1/\lambda$即可保证全局收敛。然而,对于所有介于$1/\lambda$与临界步长$2/\lambda$之间的步长,均可构造使梯度下降收敛至稳定循环的数据集。在高维情形中,即使步长小于$1/\lambda$也可能出现此类现象。研究结果表明:虽然所有小于临界值的步长都能保证局部收敛,但全局收敛性无法保证,梯度下降可能根据初始化情况收敛至循环状态。