Background. A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly uniform prior over the parameters, conditioned on that the NN perfectly classifying the training set. Interestingly, such a NN sample typically generalized as well as SGD-trained NNs. Contributions. We prove that such a random NN interpolator typically generalizes well if there exists an underlying narrow ``teacher NN" that agrees with the labels. Specifically, we show that such a `flat' prior over the NN parametrization induces a rich prior over the NN functions, due to the redundancy in the NN structure. In particular, this creates a bias towards simpler functions, which require less relevant parameters to represent -- enabling learning with a sample complexity approximately proportional to the complexity of the teacher (roughly, the number of non-redundant parameters), rather than the student's.
翻译:背景. 一个主要的理论谜题是,为何过参数化的神经网络(NN)在训练至零损失(即插值数据)时仍能良好泛化。通常,神经网络通过随机梯度下降(SGD)或其变体进行训练。然而,近期实证研究考察了插值数据的随机神经网络的泛化能力:该网络从看似均匀的参数先验中采样,并满足完美分类训练集的条件。有趣的是,此类随机网络样本的泛化性能通常与SGD训练的网络相当。贡献. 我们证明,若存在一个与标签一致的底层窄“教师网络”,则此类随机插值神经网络通常能良好泛化。具体而言,我们表明,由于神经网络结构中的冗余性,神经网络参数化上的这种“平坦”先验会诱导出函数空间上的丰富先验。特别地,这会产生对更简单函数的偏好——此类函数需更少的相关参数表示——从而使得样本复杂度约为教师网络复杂度(即非冗余参数数量)的量级,而非学生网络的复杂度。