Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically `simpler' or `easier to learn' although in a way that is difficult to formalize. Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the gradient flow dynamics of a wide two-layer neural network in high-dimension, when data are distributed according to a single-index model (i.e., the target function depends on a one-dimensional projection of the covariates). Based on a mixture of new rigorous results, non-rigorous mathematical derivations, and numerical simulations, we propose a scenario for the learning dynamics in this setting. In particular, the proposed evolution exhibits separation of timescales and intermittency. These behaviors arise naturally because the population gradient flow can be recast as a singularly perturbed dynamical system.
翻译:基于梯度的多层神经网络学习展现出若干显著特征。具体而言,即使在大批量求平均后,经验风险的下降速率仍呈现非单调性。长期停滞期(几乎观察不到任何进展)与快速下降期交替出现。这些连续的学习阶段往往在截然不同的时间尺度上进行。此外,早期阶段学习的模型通常"更简单"或"更易学习",尽管这一特性难以形式化描述。虽然已有理论解释试图阐明这些现象,但每个解释最多仅能捕捉特定机制。本文研究高维场景下宽双层神经网络的梯度流动力学,其中数据按单指标模型(即目标函数依赖于协变量的一维投影)分布。基于严格新结果、非严格数学推导与数值模拟的混合方法,我们提出了该设置下的学习动力学情景。特别地,所提出的演化过程表现出时间尺度分离与间歇性特征。这些行为自然产生,因为总体梯度流可重构为奇异摄动动力系统。