How Two-Layer Neural Networks Learn, One (Giant) Step at a Time

We investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function through a few large batch gradient descent steps, leading to improvement in the approximation capacity with respect to the initialization. We compare the influence of batch size and that of multiple (but finitely many) steps. For a single gradient step, a batch of size $n = \mathcal{O}(d)$ is both necessary and sufficient to align with the target function, although only a single direction can be learned. In contrast, $n = \mathcal{O}(d^2)$ is essential for neurons to specialize to multiple relevant directions of the target with a single gradient step. Even in this case, we show there might exist ``hard'' directions requiring $n = \mathcal{O}(d^\ell)$ samples to be learned, where $\ell$ is known as the leap index of the target. The picture drastically improves over multiple gradient steps: we show that a batch-size of $n = \mathcal{O}(d)$ is indeed enough to learn multiple target directions satisfying a staircase property, where more and more directions can be learned over time. Finally, we discuss how these directions allows to drastically improve the approximation capacity and generalization error over the initialization, illustrating a separation of scale between the random features/lazy regime, and the feature learning regime. Our technical analysis leverages a combination of techniques related to concentration, projection-based conditioning, and Gaussian equivalence which we believe are of independent interest. By pinning down the conditions necessary for specialization and learning, our results highlight the interaction between batch size and number of iterations, and lead to a hierarchical depiction where learning performance exhibits a stairway to accuracy over time and batch size, shedding new light on how neural networks adapt to features of the data.

翻译：我们从理论上研究了两层神经网络的特征如何通过几次大批量梯度下降步骤适应目标函数的结构，从而相对于初始化提高逼近能力。我们比较了批量大小与多个（但有限）步骤的影响。对于单步梯度下降，批量大小为 $n = \mathcal{O}(d)$ 既是与目标函数对齐的必要条件也是充分条件，尽管只能学习单个方向。相比之下，$n = \mathcal{O}(d^2)$ 对于神经元在单步梯度下降中专精于目标的多个相关方向至关重要。即使在这种情况下，我们证明可能存在需要 $n = \mathcal{O}(d^\ell)$ 个样本才能学习的“困难”方向，其中 $\ell$ 被称为目标的跃迁指数。多步梯度下降显著改善了这种情况：我们证明批量大小为 $n = \mathcal{O}(d)$ 确实足以学习满足阶梯性质的多个目标方向，随着时间的推移可以学习越来越多的方向。最后，我们讨论了这些方向如何显著提高相对于初始化的逼近能力和泛化误差，说明了随机特征/惰性机制与特征学习机制之间的尺度分离。我们的技术分析结合了与浓度、基于投影的条件化和高斯等价性相关的技术，我们相信这些技术具有独立的研究价值。通过确定专精和学习所需的条件，我们的结果突出了批量大小和迭代次数之间的相互作用，并导致一个层次化的描述，其中学习性能在时间和批量大小上呈现出准确性的阶梯，为神经网络如何适应数据特征提供了新的见解。