Neural networks often operate in the overparameterized regime, in which there are far more parameters than training samples, allowing the training data to be fit perfectly. That is, training the network effectively learns an interpolating function, and properties of the interpolant affect predictions the network will make on new samples. This manuscript explores how properties of such functions learned by neural networks of depth greater than two layers. Our framework considers a family of networks of varying depths that all have the same capacity but different representation costs. The representation cost of a function induced by a neural network architecture is the minimum sum of squared weights needed for the network to represent the function; it reflects the function space bias associated with the architecture. Our results show that adding additional linear layers to the input side of a shallow ReLU network yields a representation cost favoring functions with low mixed variation - that is, it has limited variation in directions orthogonal to a low-dimensional subspace and can be well approximated by a single- or multi-index model. Such functions may be represented by the composition of a function with low two-layer representation cost and a low-rank linear operator. Our experiments confirm this behavior in standard network training regimes. They additionally show that linear layers can improve generalization and the learned network is well-aligned with the true latent low-dimensional linear subspace when data is generated using a multi-index model.
翻译:神经网络通常在过参数化状态下运行,即参数数量远多于训练样本数,从而能够完美拟合训练数据。这意味着网络训练过程实质上学习了一个插值函数,而该插值函数的特性会影响网络对新样本的预测。本文探究了深度超过两层的神经网络所学习函数的特性。我们的框架考虑了一系列不同深度的网络,这些网络具有相同的容量但不同的表示成本。由神经网络架构诱导的函数的表示成本是指网络表示该函数所需的最小权重平方和;它反映了与架构相关的函数空间偏好。我们的研究结果表明,在浅层ReLU网络的输入侧添加额外的线性层会产生倾向于低混合变异函数的表示成本——即该函数在与低维子空间正交的方向上变异有限,且能够通过单索引或多索引模型良好近似。此类函数可由具有低两层表示成本的函数与低秩线性算子的复合表示。我们的实验在标准网络训练机制中验证了这一现象。实验还表明,当数据由多索引模型生成时,线性层能够提升泛化能力,且学习到的网络与真实的潜在低维线性子空间高度对齐。