The successful training of neural networks hinges on the use of first order optimization methods, yet the theoretical characterization of these methods remains incomplete. This is especially true in settings with mild overparameterization. In this work, we study the gradient flow dynamics of two-layer ReLU networks from small initialization with orthogonal training data. We prove the limiting flow converges to a saddle-to-saddle jump process as the initialization scale tends to zero, revealing an incremental learning phenomenon in which a new neuron activates at each saddle. This analysis recovers the known result of Dana et al. (2025, arXiv:2502.16977) that the network interpolates the training data with high probability as soon as $m \gtrsim \log(n)$, where $m$ is the network width and $n$ is the number of training samples. This incremental process characterization also allows us to derive a novel implicit bias result: the learned interpolator has a squared $\ell_2$-norm scaling as $\sqrt{n}$, which is within a constant factor of the minimal $\ell_2$-norm interpolator. More broadly, our work provides the first rigorous proof of an incremental learning process for ReLU networks, whilst suggesting mildly overparameterized networks can converge to interpolating solutions whose complexity is of the same order as that of the optimal interpolator.
翻译:神经网络的成功训练依赖于一阶优化方法,然而这些方法的理论刻画仍不完善,尤其在轻度过参数化设定下。本文研究了从微小初始化出发、使用正交训练数据的双层ReLU网络的梯度流动力学。我们证明了当初始化尺度趋于零时,极限流收敛至鞍点至鞍点的跳跃过程,揭示了每个鞍点处激活一个新神经元的增量学习现象。这一分析复现了Dana等人(2025, arXiv:2502.16977)的已知结果:当$m \gtrsim \log(n)$时(其中$m$为网络宽度,$n$为训练样本数),网络以高概率插值训练数据。该增量过程刻画还使我们推导出一个新颖的隐式偏置结果:学习到的插值器的平方$\ell_2$范数以$\sqrt{n}$缩放,与最小$\ell_2$范数插值器仅相差常数因子。更广泛地,本文为ReLU网络增量学习过程提供了首个严格证明,同时表明轻度过参数化网络可收敛至复杂度与最优插值器同阶的插值解。