Modern deep neural networks are highly over-parameterized compared to the data on which they are trained, yet they often generalize remarkably well. A flurry of recent work has asked: why do deep networks not overfit to their training data? In this work, we make a series of empirical observations that investigate and extend the hypothesis that deeper networks are inductively biased to find solutions with lower effective rank embeddings. We conjecture that this bias exists because the volume of functions that maps to low effective rank embedding increases with depth. We show empirically that our claim holds true on finite width linear and non-linear models on practical learning paradigms and show that on natural data, these are often the solutions that generalize well. We then show that the simplicity bias exists at both initialization and after training and is resilient to hyper-parameters and learning methods. We further demonstrate how linear over-parameterization of deep non-linear models can be used to induce low-rank bias, improving generalization performance on CIFAR and ImageNet without changing the modeling capacity.
翻译:现代深度神经网络相较于训练数据而言高度过参数化,却常常展现出惊人的泛化能力。近期一系列研究探讨了这样一个问题:深度网络为何不会对训练数据过拟合?在本工作中,我们通过一系列实证观察,深入探究并拓展了如下假设:深层网络存在归纳偏好,倾向于寻找具有更低有效秩嵌入的解。我们推测这一偏好的存在,是因为映射到低有效秩嵌入的函数体积随深度增加而增大。我们通过实证表明,在有限宽度的线性与非线性模型及实际学习范式下,该结论成立,并且自然数据中这类解往往具有良好泛化性能。进一步,我们证明这种简约性偏好同时存在于初始化阶段与训练之后,且对超参数与学习方法具有鲁棒性。最后,我们展示了如何利用深度非线性模型的线性过参数化诱导低秩偏好,在不改变模型容量的前提下提升CIFAR和ImageNet上的泛化表现。