We study the type of solutions to which stochastic gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss. Our results are based on a dynamical stability analysis. In the univariate case, it was shown that linearly stable minima correspond to network functions (predictors), whose second derivative has a bounded weighted $L^1$ norm. Notably, the bound gets smaller as the step size increases, implying that training with a large step size leads to `smoother' predictors. Here we generalize this result to the multivariate case, showing that a similar result applies to the Laplacian of the predictor. We demonstrate the tightness of our bound on the MNIST dataset, and show that it accurately captures the behavior of the solutions as a function of the step size. Additionally, we prove a depth separation result on the approximation power of ReLU networks corresponding to stable minima of the loss. Specifically, although shallow ReLU networks are universal approximators, we prove that stable shallow networks are not. Namely, there is a function that cannot be well-approximated by stable single hidden-layer ReLU networks trained with a non-vanishing step size. This is while the same function can be realized as a stable two hidden-layer ReLU network. Finally, we prove that if a function is sufficiently smooth (in a Sobolev sense) then it can be approximated arbitrarily well using shallow ReLU networks that correspond to stable solutions of gradient descent.
翻译:我们研究了在二次损失函数下使用随机梯度下降训练单隐藏层多元ReLU网络时,其所收敛解的类型。我们的结果基于动态稳定性分析。在单变量情形中,已证明线性稳定极小值对应网络函数(预测器),其二阶导数的加权$L^1$范数有界。值得注意的是,该界随步长增大而减小,意味着大步长训练会产生更“平滑”的预测器。本文将此结果推广至多元情形,表明类似结论适用于预测器的拉普拉斯算子。我们在MNIST数据集上验证了所提界的紧致性,并证明其能准确刻画解随步长变化的特性。此外,我们证明了关于损失稳定极小值对应ReLU网络逼近能力的深度分离性结果。具体而言,尽管浅层ReLU网络是通用逼近器,我们证明稳定浅层网络却不具备此性质。即存在某个函数无法被采用非消失步长训练的稳定单隐藏层ReLU网络有效逼近,而该函数却可实现为稳定的双隐藏层ReLU网络。最后,我们证明若函数足够光滑(在Sobolev意义下),则可被对应于梯度下降稳定解的浅层ReLU网络任意逼近。