We study gradient descent (GD) dynamics on logistic regression problems with large, constant step sizes. For linearly-separable data, it is known that GD converges to the minimizer with arbitrarily large step sizes, a property which no longer holds when the problem is not separable. In fact, the behaviour can be much more complex -- a sequence of period-doubling bifurcations begins at the critical step size $2/\lambda$, where $\lambda$ is the largest eigenvalue of the Hessian at the solution. Using a smaller-than-critical step size guarantees convergence if initialized nearby the solution: but does this suffice globally? In one dimension, we show that a step size less than $1/\lambda$ suffices for global convergence. However, for all step sizes between $1/\lambda$ and the critical step size $2/\lambda$, one can construct a dataset such that GD converges to a stable cycle. In higher dimensions, this is actually possible even for step sizes less than $1/\lambda$. Our results show that although local convergence is guaranteed for all step sizes less than the critical step size, global convergence is not, and GD may instead converge to a cycle depending on the initialization.
翻译:本研究探讨了在恒定大步长条件下梯度下降(GD)在逻辑回归问题上的动态特性。对于线性可分数据,已知GD可在任意大步长下收敛至最小化点,但该性质在问题不可分时不再成立。实际上,其行为可能更为复杂——当步长超过临界值$2/\lambda$时(其中$\lambda$为解点处海森矩阵的最大特征值),系统会出现一系列倍周期分岔现象。若采用小于临界值的步长,在初始点靠近解时能保证收敛:但这是否足以实现全局收敛?在一维情况下,我们证明小于$1/\lambda$的步长即可保证全局收敛。然而,对于所有介于$1/\lambda$与临界步长$2/\lambda$之间的步长,均可构造使GD收敛至稳定循环的数据集。在高维情形中,即使步长小于$1/\lambda$也可能出现该现象。我们的研究结果表明:虽然所有小于临界值的步长都能保证局部收敛,但全局收敛性无法保证,GD可能根据初始条件收敛至循环状态。