We study the implicit bias of flatness / low (loss) curvature and its effects on generalization in two-layer overparameterized ReLU networks with multivariate inputs -- a problem well motivated by the minima stability and edge-of-stability phenomena in gradient-descent training. Existing work either requires interpolation or focuses only on univariate inputs. This paper presents new and somewhat surprising theoretical results for multivariate inputs. On two natural settings (1) generalization gap for flat solutions, and (2) mean-squared error (MSE) in nonparametric function estimation by stable minima, we prove upper and lower bounds, which establish that while flatness does imply generalization, the resulting rates of convergence necessarily deteriorate exponentially as the input dimension grows. This gives an exponential separation between the flat solutions compared to low-norm solutions (i.e., weight decay), which are known not to suffer from the curse of dimensionality. In particular, our minimax lower bound construction, based on a novel packing argument with boundary-localized ReLU neurons, reveals how flat solutions can exploit a kind of "neural shattering" where neurons rarely activate, but with high weight magnitudes. This leads to poor performance in high dimensions. We corroborate these theoretical findings with extensive numerical simulations. To the best of our knowledge, our analysis provides the first systematic explanation for why flat minima may fail to generalize in high dimensions.
翻译:本研究探讨平坦性/低(损失)曲率的隐式偏置及其对具有多元输入的两层过参数化ReLU网络泛化性能的影响——该问题受到梯度下降训练中极小值稳定性与稳定性边缘现象的充分驱动。现有研究要么要求插值条件,要么仅关注单变量输入。本文针对多元输入提出了新颖且颇具启示性的理论结果。在两个自然设定下:(1)平坦解的泛化差距,(2)稳定极小值在非参数函数估计中的均方误差(MSE),我们证明了上界与下界,这些界共同表明:虽然平坦性确实能保证泛化,但其收敛速率必然随着输入维度的增长而呈指数级恶化。这揭示了平坦解与低范数解(即权重衰减)之间的指数级分离,已知后者不会遭受维度灾难。特别地,我们基于边界局部化ReLU神经元的新型打包论证所构建的极小极大下界,揭示了平坦解如何利用一种"神经破碎"机制:神经元极少激活,却具有高权重幅值。这导致在高维空间中表现不佳。我们通过大量数值模拟验证了这些理论发现。据我们所知,该分析首次系统解释了平坦极小值在高维情况下可能泛化失败的原因。