Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically `simpler' or `easier to learn' although in a way that is difficult to formalize. Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the gradient flow dynamics of a wide two-layer neural network in high-dimension, when data are distributed according to a single-index model (i.e., the target function depends on a one-dimensional projection of the covariates). Based on a mixture of new rigorous results, non-rigorous mathematical derivations, and numerical simulations, we propose a scenario for the learning dynamics in this setting. In particular, the proposed evolution exhibits separation of timescales and intermittency. These behaviors arise naturally because the population gradient flow can be recast as a singularly perturbed dynamical system.
翻译:基于梯度的多层神经网络学习展现出若干显著特征。具体而言,即使经过大批量数据平均,经验风险的下降速率仍呈现非单调特性。在几乎观察不到任何进展的漫长平台期与快速下降区间交替出现。这些连续的学习阶段往往发生在截然不同的时间尺度上。此外,早期阶段习得的模型通常更为"简单"或"易于学习",但难以用形式化方式加以描述。尽管已有理论解释尝试阐述这些现象,但每种解释最多只能捕捉特定机制。本文研究高维场景下宽双层神经网络的梯度流动力学,其中数据服从单指标模型分布(即目标函数依赖于协变量的一维投影)。基于严谨新结果、非严谨数学推导及数值模拟的混合方法,我们提出该场景下学习动力学的演化图景。特别地,该演化过程展现出时间尺度分离与间歇性特征。这些行为的自然产生源于群体梯度流可被重构为奇异摄动动力系统。