We study the generalization of two-layer ReLU neural networks in a univariate nonparametric regression problem with noisy labels. This is a problem where kernels (\emph{e.g.} NTK) are provably sub-optimal and benign overfitting does not happen, thus disqualifying existing theory for interpolating (0-loss, global optimal) solutions. We present a new theory of generalization for local minima that gradient descent with a constant learning rate can \emph{stably} converge to. We show that gradient descent with a fixed learning rate $\eta$ can only find local minima that represent smooth functions with a certain weighted \emph{first order total variation} bounded by $1/\eta - 1/2 + \widetilde{O}(\sigma + \sqrt{\mathrm{MSE}})$ where $\sigma$ is the label noise level, $\mathrm{MSE}$ is short for mean squared error against the ground truth, and $\widetilde{O}(\cdot)$ hides a logarithmic factor. Under mild assumptions, we also prove a nearly-optimal MSE bound of $\widetilde{O}(n^{-4/5})$ within the strict interior of the support of the $n$ data points. Our theoretical results are validated by extensive simulation that demonstrates large learning rate training induces sparse linear spline fits. To the best of our knowledge, we are the first to obtain generalization bound via minima stability in the non-interpolation case and the first to show ReLU NNs without regularization can achieve near-optimal rates in nonparametric regression.
翻译:我们研究带噪声标签的单变量非参数回归问题中两层ReLU神经网络的泛化性能。在该问题中,核方法(例如NTK)被证明是次优的,且良性过拟合现象不会发生,这使得现有针对插值(零损失、全局最优)解的理论不再适用。我们提出了一种关于梯度下降法以恒定学习率能够稳定收敛到的局部极小值的新泛化理论。我们证明,采用固定学习率$\eta$的梯度下降法只能找到表示平滑函数的局部极小值,其加权一阶全变差受限于$1/\eta - 1/2 + \widetilde{O}(\sigma + \sqrt{\mathrm{MSE}})$,其中$\sigma$为标签噪声水平,$\mathrm{MSE}$表示相对于真实值的均方误差,$\widetilde{O}(\cdot)$隐藏了对数因子。在温和假设下,我们还在$n$个数据点支撑集的严格内部证明了近乎最优的$\widetilde{O}(n^{-4/5})$均方误差界。我们的理论结果通过大量仿真得到验证,这些仿真表明大学习率训练会诱导稀疏的线性样条拟合。据我们所知,我们首次在非插值情形下通过极小值稳定性获得泛化界,并首次证明无正则化的ReLU神经网络能够在非参数回归中达到接近最优的收敛速率。