The ability of deep neural networks to generalise well even when they interpolate their training data has been explained using various "simplicity biases". These theories postulate that neural networks avoid overfitting by first learning simple functions, say a linear classifier, before learning more complex, non-linear functions. Meanwhile, data structure is also recognised as a key ingredient for good generalisation, yet its role in simplicity biases is not yet understood. Here, we show that neural networks trained using stochastic gradient descent initially classify their inputs using lower-order input statistics, like mean and covariance, and exploit higher-order statistics only later during training. We first demonstrate this distributional simplicity bias (DSB) in a solvable model of a neural network trained on synthetic data. We empirically demonstrate DSB in a range of deep convolutional networks and visual transformers trained on CIFAR10, and show that it even holds in networks pre-trained on ImageNet. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of Gaussian universality in learning.
翻译:深度神经网络即使在插值训练数据时也能良好泛化的能力,已通过多种"简单性偏好"理论得到解释。这些理论假设神经网络通过先学习简单函数(如线性分类器)再学习更复杂的非线性函数来避免过拟合。同时,数据结构也被视为良好泛化的关键因素,但其在简单性偏好中的作用尚不明确。本文证明,使用随机梯度下降训练的神经网络最初利用均值、协方差等低阶输入统计量对输入进行分类,仅在训练后期才利用高阶统计量。我们首先在合成数据训练的神经网络可解模型中论证这种"分布简单性偏好"(DSB)。通过在CIFAR10上训练的一系列深度卷积网络和视觉Transformer中的实证研究,我们证明DSB甚至存在于ImageNet预训练网络中。最后,我们讨论了DSB与其他简单性偏好的关系,并探讨其对高斯普适性学习原则的影响。