We study the implicit bias of flatness / low (loss) curvature and its effects on generalization in two-layer overparameterized ReLU networks with multivariate inputs -- a problem well motivated by the minima stability and edge-of-stability phenomena in gradient-descent training. Existing work either requires interpolation or focuses only on univariate inputs. This paper presents new and somewhat surprising theoretical results for multivariate inputs. On two natural settings (1) generalization gap for flat solutions, and (2) mean-squared error (MSE) in nonparametric function estimation by stable minima, we prove upper and lower bounds, which establish that while flatness does imply generalization, the resulting rates of convergence necessarily deteriorate exponentially as the input dimension grows. This gives an exponential separation between the flat solutions vis-\`a-vis low-norm solutions (i.e., weight decay), which knowingly do not suffer from the curse of dimensionality. In particular, our minimax lower bound construction, based on a novel packing argument with boundary-localized ReLU neurons, reveals how flat solutions can exploit a kind of ''neural shattering'' where neurons rarely activate, but with high weight magnitudes. This leads to poor performance in high dimensions. We corroborate these theoretical findings with extensive numerical simulations. To the best of our knowledge, our analysis provides the first systematic explanation for why flat minima may fail to generalize in high dimensions.
翻译:本研究探讨平坦性/低(损失)曲率的隐式偏置及其对具有多元输入的两层过参数化ReLU网络泛化性能的影响——该问题由梯度下降训练中的极小值稳定性和稳定性边缘现象充分验证。现有研究要么要求插值条件,要么仅关注单变量输入。本文针对多元输入提出了新颖且令人惊讶的理论结果。在两个自然设定下:(1)平坦解的泛化间隙,(2)稳定极小值在非参数函数估计中的均方误差(MSE),我们证明了上下界,这些界确立了平坦性虽能保证泛化,但其收敛速率必然随输入维度增长而指数级恶化。这揭示了平坦解与低范数解(即权重衰减)之间存在指数级分离,已知后者不受维度诅咒影响。特别地,我们基于边界局部化ReLU神经元的新型打包论证所构建的极小极大下界,揭示了平坦解如何利用一种"神经破碎"现象:神经元很少激活,却具有高权重幅值。这导致高维性能低下。我们通过大量数值模拟验证了这些理论发现。据我们所知,该分析首次系统解释了平坦极小值在高维空间中可能泛化失败的原因。