Modern deep neural networks (DNNs) have achieved state-of-the-art performances but are typically over-parameterized. The over-parameterization may result in undesirably large generalization error in the absence of other customized training strategies. Recently, a line of research under the name of Sharpness-Aware Minimization (SAM) has shown that minimizing a sharpness measure, which reflects the geometry of the loss landscape, can significantly reduce the generalization error. However, SAM-like methods incur a two-fold computational overhead of the given base optimizer (e.g. SGD) for approximating the sharpness measure. In this paper, we propose Sharpness-Aware Training for Free, or SAF, which mitigates the sharp landscape at almost zero additional computational cost over the base optimizer. Intuitively, SAF achieves this by avoiding sudden drops in the loss in the sharp local minima throughout the trajectory of the updates of the weights. Specifically, we suggest a novel trajectory loss, based on the KL-divergence between the outputs of DNNs with the current weights and past weights, as a replacement of the SAM's sharpness measure. This loss captures the rate of change of the training loss along the model's update trajectory. By minimizing it, SAF ensures the convergence to a flat minimum with improved generalization capabilities. Extensive empirical results show that SAF minimizes the sharpness in the same way that SAM does, yielding better results on the ImageNet dataset with essentially the same computational cost as the base optimizer.
翻译:现代深度神经网络(DNN)已取得最先进的性能,但通常存在过参数化问题。在没有其他定制训练策略的情况下,过参数化可能导致不理想的大泛化误差。近年来,一类名为锐度感知最小化(SAM)的研究表明,最小化反映损失景观几何结构的锐度度量可显著降低泛化误差。然而,SAM类方法在近似锐度度量时所需的计算开销是给定基础优化器(如SGD)的两倍。本文提出“面向免费的锐度感知训练”(SAF),该方法以几乎不增加基础优化器额外计算成本的方式缓解尖锐的损失景观。直观上,SAF通过避免权重更新轨迹中尖锐局部最小值处的损失骤降来实现这一目标。具体而言,我们提出一种基于当前权重与历史权重下DNN输出之间KL散度的新型轨迹损失,作为SAM锐度度量的替代方案。该损失捕捉了训练损失沿模型更新轨迹的变化率。通过最小化该损失,SAF确保收敛到具有更好泛化能力的平坦最小值。大量实验结果表明,SAF能以与SAM相同的方式最小化锐度,在ImageNet数据集上取得更优结果,且计算成本与基础优化器基本一致。