Flatness of the loss curve around a model at hand has been shown to empirically correlate with its generalization ability. Optimizing for flatness has been proposed as early as 1994 by Hochreiter and Schmidthuber, and was followed by more recent successful sharpness-aware optimization techniques. Their widespread adoption in practice, though, is dubious because of the lack of theoretically grounded connection between flatness and generalization, in particular in light of the reparameterization curse - certain reparameterizations of a neural network change most flatness measures but do not change generalization. Recent theoretical work suggests that a particular relative flatness measure can be connected to generalization and solves the reparameterization curse. In this paper, we derive a regularizer based on this relative flatness that is easy to compute, fast, efficient, and works with arbitrary loss functions. It requires computing the Hessian only of a single layer of the network, which makes it applicable to large neural networks, and with it avoids an expensive mapping of the loss surface in the vicinity of the model. In an extensive empirical evaluation we show that this relative flatness aware minimization (FAM) improves generalization in a multitude of applications and models, both in finetuning and standard training. We make the code available at github.
翻译:模型损失曲线的平坦度已被经验证明与其泛化能力相关。早在1994年,Hochreiter和Schmidthuber就提出了优化平坦度的方法,随后又出现了更成功的锐度感知优化技术。然而,由于平坦度与泛化之间缺乏理论依据的联系——特别是存在重参数化诅咒(神经网络的某些重参数化会改变多数平坦度度量,但不会改变泛化性能)——这些方法在实际中的广泛应用仍存疑虑。近期理论研究表明,特定相对平坦度度量可与泛化建立联系,并解决重参数化诅咒问题。本文基于该相对平坦度推导出一种正则化器,其易于计算、快速高效,且适用于任意损失函数。该方法仅需计算网络中单层Hessian矩阵,从而可应用于大型神经网络,并避免了对模型邻域损失曲面的昂贵映射。通过大量实证评估,我们证明这种相对平坦度感知最小化(FAM)在微调与标准训练场景中,均能提升多种应用与模型的泛化性能。我们已在github上公开相关代码。