A common heuristic used to explain the generalization of first-order gradient methods on non-convex neural networks is that "flat interpolators generalize well" (Hochreiter and Schmidhuber, 1994; Keskar et al., 2017), where flatness can be measured by the trace of the Hessian of the empirical loss. However, Dinh et al. 2017) showed that, using symmetry of the network that can change flatness while keeping the population and empirical losses unchanged, any interpolator can be made sharper or flatter. This result makes the earlier heuristic statement vacuous. In this paper, we show that for learning an unknown multi-index model with $2$-layer non-convex homogeneous neural networks, there is a connection between flatness and generalization, despite the existence of symmetries. This connection pertains to the "flattest" interpolators, i.e., the interpolators that have orderwise minimum flatness among all interpolators. First, we show that there exists a natural class of non-generalizing interpolators whose flatness cannot be made closer to the flattest possible, even using symmetries. Second, we show that for data generated by a sum of single-index models, if the approximation error and label noise are low, any flattest interpolator achieves small population loss, i.e., the flattest interpolators always generalize. This establishes a direct link between flatness and generalization which applies to a large class of activations and realistic data distributions.
翻译:一种常用于解释非凸神经网络上一阶梯度方法泛化能力的启发式观点是“平坦插值器泛化良好”(Hochreiter and Schmidhuber, 1994; Keskar et al., 2017),其中平坦性可通过经验损失函数Hessian矩阵的迹来衡量。然而,Dinh等人(2017)指出,利用网络对称性(可在保持总体损失和经验损失不变的情况下改变平坦性),任何插值器都可变得更为尖锐或更为平坦。这一结果使得前述启发式论断失去意义。本文表明,对于使用两层非凸齐次神经网络学习未知多元指标模型的任务,尽管存在对称性,平坦性与泛化之间仍存在关联。这种关联针对的是“最平坦”插值器,即所有插值器中平坦性阶数最小的那些。首先,我们证明存在一类自然的非泛化插值器,其平坦性即使通过对称性也无法接近最平坦可能值。其次,我们证明:对于由单指标模型之和生成的数据,若近似误差与标签噪声较低,则任何最平坦插值器都能实现较小的总体损失,即最平坦插值器始终具有泛化能力。这建立了适用于多种激活函数及现实数据分布的平坦性与泛化之间的直接联系。