Neural networks trained with gradient descent often learn solutions of increasing complexity over time, a phenomenon known as simplicity bias. Despite being widely observed across architectures, existing theoretical treatments lack a unifying framework. We present a theoretical framework that explains a simplicity bias arising from saddle-to-saddle learning dynamics for a general class of neural networks, incorporating fully-connected, convolutional, and attention-based architectures. Here, simple means expressible with few hidden units, i.e., hidden neurons, convolutional kernels, or attention heads. Specifically, we show that linear networks learn solutions of increasing rank, ReLU networks learn solutions with an increasing number of kinks, convolutional networks learn solutions with an increasing number of convolutional kernels, and self-attention models learn solutions with an increasing number of attention heads. By analyzing fixed points, invariant manifolds, and dynamics of gradient descent learning, we show that saddle-to-saddle dynamics operates by iteratively evolving near an invariant manifold, approaching a saddle, and switching to another invariant manifold. Our analysis also illuminates the effects of data distribution and weight initialization on the duration and number of plateaus in learning, dissociating previously confounding factors. Overall, our theory offers a framework for understanding when and why gradient descent progressively learns increasingly complex solutions.
翻译:使用梯度下降训练的神经网络通常随时间学习复杂度递增的解,这一现象被称为简洁性偏好。尽管在不同架构中广泛观察到这一现象,现有理论分析缺乏统一框架。我们提出了一个理论框架,解释了一类通用神经网络(包括全连接、卷积和基于注意力的架构)中由鞍点到鞍点学习动力学产生的简洁性偏好。此处的"简洁"指可用少量隐藏单元(即隐藏神经元、卷积核或注意力头)表达。具体而言,我们证明:线性网络学习秩递增的解,ReLU网络学习拐点数量递增的解,卷积网络学习卷积核数量递增的解,自注意力模型学习注意力头数量递增的解。通过分析梯度下降学习的固定点、不变流形和动力学,我们证明鞍点到鞍点动力学通过以下方式运作:在不变流形附近迭代演化,逼近鞍点,然后切换到另一个不变流形。我们的分析还阐明了数据分布和权重初始化对学习中平台期持续时间和数量的影响,从而分离了先前混淆的因素。总体而言,我们的理论为理解梯度下降何时及为何逐步学习日益复杂的解提供了一个框架。