Understanding generalization in overparameterized neural networks hinges on the interplay between the data geometry, neural architecture, and training dynamics. In this paper, we theoretically explore how data geometry controls this implicit bias. This paper presents theoretical results for overparameterized two-layer ReLU networks trained below the edge of stability. First, for data distributions supported on a mixture of low-dimensional balls, we derive generalization bounds that provably adapt to the intrinsic dimension. Second, for a family of isotropic distributions that vary in how strongly probability mass concentrates toward the unit sphere, we derive a spectrum of bounds showing that rates deteriorate as the mass concentrates toward the sphere. These results instantiate a unifying principle: When the data is harder to "shatter" with respect to the activation thresholds of the ReLU neurons, gradient descent tends to learn representations that capture shared patterns and thus finds solutions that generalize well. On the other hand, for data that is easily shattered (e.g., data supported on the sphere) gradient descent favors memorization. Our theoretical results consolidate disparate empirical findings that have appeared in the literature.
翻译:理解过参数化神经网络中的泛化能力,关键在于数据几何结构、神经网络架构与训练动态之间的相互作用。本文从理论上探讨了数据几何结构如何控制这种隐式偏差。本文针对训练于稳定性边缘以下的过参数化两层ReLU网络提出了理论结果。首先,对于支撑在低维球混合体上的数据分布,我们推导出可证明适应于内在维度的泛化界。其次,针对一类各向同性分布(其概率质量向单位球面集中的强度各不相同),我们推导出一系列边界,表明随着质量向球面集中,泛化速率会逐渐恶化。这些结果具体阐明了一个统一原理:当数据相对于ReLU神经元的激活阈值更难以“打散”时,梯度下降倾向于学习能够捕捉共享模式的表示,从而找到泛化良好的解;反之,对于易于打散的数据(例如支撑在球面上的数据),梯度下降则倾向于记忆。我们的理论结果整合了文献中出现的各种分散的实验发现。