Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically `simpler' or `easier to learn' although in a way that is difficult to formalize. Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the gradient flow dynamics of a wide two-layer neural network in high-dimension, when data are distributed according to a single-index model (i.e., the target function depends on a one-dimensional projection of the covariates). Based on a mixture of new rigorous results, non-rigorous mathematical derivations, and numerical simulations, we propose a scenario for the learning dynamics in this setting. In particular, the proposed evolution exhibits separation of timescales and intermittency. These behaviors arise naturally because the population gradient flow can be recast as a singularly perturbed dynamical system.
翻译:梯度驱动的多层神经网络学习过程展现出若干显著特征。具体而言,即使在大批量数据平均化后,经验风险的下降速率仍呈现非单调性。漫长的进步停滞期(其间几乎观察不到任何改进)与快速下降间隔交替出现。这些连续的学习阶段往往在截然不同的时间尺度上展开。此外,早期阶段习得的模型通常更为"简单"或"易于学习",尽管这种特性难以形式化表述。虽然已有理论对这些现象进行解释,但每种解释最多只能捕捉特定机制场景。本研究针对高维空间中宽两层神经网络的梯度流动力学展开分析,数据服从单指标模型分布(即目标函数依赖于协变量的一维投影)。通过结合全新严谨结果、非严谨数学推导及数值模拟,我们提出了该场景下学习动力学的一种演化图景。特别地,所提出的演化过程表现出时间尺度分离与间歇性特征。这些行为之所以自然涌现,是因为总体梯度流可被重构为奇异摄动动力系统。