A vast literature on convergence guarantees for gradient descent and derived methods exists at the moment. However, a simple practical situation remains unexplored: when a fixed step size is used, can we expect gradient descent to converge starting from any initialization? We provide fundamental impossibility results showing that convergence becomes impossible no matter the initialization if the step size gets too big. Looking at the asymptotic value of the gradient norm along the optimization trajectory, we see that there is a phase transition as the step size crosses a critical value. This has been observed by practitioners, yet the true mechanisms through which this happens remain unclear beyond heuristics. Using results from dynamical systems theory, we provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient. We validate our findings through experiments with non-linear networks.
翻译:目前已有大量文献探讨梯度下降及其衍生方法的收敛性保证。然而,一个简单的实际情境仍未被充分研究:当使用固定步长时,我们能否期望梯度下降从任意初始点开始收敛?我们提供了基本的不可能性结果,表明如果步长过大,无论初始化如何,收敛都将变得不可能。通过观察优化轨迹上梯度范数的渐近值,我们发现当步长超过临界值时会出现相变现象。这一现象已被实践者所观察,但其发生的真实机制在启发式解释之外仍不明确。利用动力系统理论的结果,我们在使用平方损失的线性神经网络情形下证明了这一现象。同时,我们证明了对于更一般的损失函数,即使不要求梯度满足利普希茨连续性等强假设,收敛同样不可能。我们通过非线性网络的实验验证了这些发现。