Deep neural networks often suffer from poor generalization due to complex and non-convex loss landscapes. Sharpness-Aware Minimization (SAM) is a popular solution that smooths the loss landscape by minimizing the maximized change of training loss when adding a perturbation to the weight. However, indiscriminate perturbation of SAM on all parameters is suboptimal and results in excessive computation, double the overhead of common optimizers like Stochastic Gradient Descent (SGD). In this paper, we propose Sparse SAM (SSAM), an efficient and effective training scheme that achieves sparse perturbation by a binary mask. To obtain the sparse mask, we provide two solutions based on Fisher information and dynamic sparse training, respectively. We investigate the impact of different masks, including unstructured, structured, and $N$:$M$ structured patterns, as well as explicit and implicit forms of implementing sparse perturbation. We theoretically prove that SSAM can converge at the same rate as SAM, i.e., $O(\log T/\sqrt{T})$. Sparse SAM has the potential to accelerate training and smooth the loss landscape effectively. Extensive experimental results on CIFAR and ImageNet-1K confirm that our method is superior to SAM in terms of efficiency, and the performance is preserved or even improved with a perturbation of merely 50\% sparsity. Code is available at https://github.com/Mi-Peng/Systematic-Investigation-of-Sparse-Perturbed-Sharpness-Aware-Minimization-Optimizer.
翻译:深度神经网络常因复杂的非凸损失景观导致泛化能力不佳。锐度感知最小化(SAM)是一种流行解决方案,通过最小化对权重施加扰动时训练损失的最大化变化来平滑损失景观。然而,SAM对所有参数不加区分的扰动是次优的,并导致计算量激增,其开销是随机梯度下降(SGD)等常见优化器的两倍。本文提出稀疏SAM(SSAM),一种通过二元掩码实现稀疏扰动的高效训练方案。为获取稀疏掩码,我们分别基于Fisher信息和动态稀疏训练提出两种解决方案。我们系统研究了非结构化、结构化及$N$:$M$结构化模式等不同掩码类型的影响,以及实现稀疏扰动的显式与隐式形式。理论证明SSAM可收敛至与SAM相同的速度,即$O(\log T/\sqrt{T})$。稀疏SAM具备加速训练并有效平滑损失景观的潜力。在CIFAR和ImageNet-1K上的大量实验结果表明,本方法在效率上优于SAM,且仅需50%稀疏度的扰动即可保持甚至提升性能。代码开源于https://github.com/Mi-Peng/Systematic-Investigation-of-Sparse-Perturbed-Sharpness-Aware-Minimization-Optimizer。