We conduct a comprehensive investigation into the dynamics of gradient descent using large-order constant step-sizes in the context of quadratic regression models. Within this framework, we reveal that the dynamics can be encapsulated by a specific cubic map, naturally parameterized by the step-size. Through a fine-grained bifurcation analysis concerning the step-size parameter, we delineate five distinct training phases: (1) monotonic, (2) catapult, (3) periodic, (4) chaotic, and (5) divergent, precisely demarcating the boundaries of each phase. As illustrations, we provide examples involving phase retrieval and two-layer neural networks employing quadratic activation functions and constant outer-layers, utilizing orthogonal training data. Our simulations indicate that these five phases also manifest with generic non-orthogonal data. We also empirically investigate the generalization performance when training in the various non-monotonic (and non-divergent) phases. In particular, we observe that performing an ergodic trajectory averaging stabilizes the test error in non-monotonic (and non-divergent) phases.
翻译:我们针对采用大阶恒定步长的梯度下降在二次回归模型中的动力学行为进行了全面研究。在此框架下,我们揭示了动力学可由一个特定的三次映射所概括,该映射自然由步长参数化。通过对步长参数的精细分岔分析,我们划分出五个不同的训练阶段:(1)单调阶段、(2)弹射阶段、(3)周期阶段、(4)混沌阶段和(5)发散阶段,并精确界定了每个阶段的边界。作为示例,我们提供了涉及相位恢复和采用二次激活函数及恒定外层的两层神经网络(使用正交训练数据)的实例。模拟表明,这五个阶段在使用一般非正交数据时同样出现。我们还实证研究了在各种非单调(且非发散)阶段中训练时的泛化性能。特别地,我们观察到在非单调(且非发散)阶段中执行遍历轨迹平均可以稳定测试误差。