The recently proposed optimization algorithm for deep neural networks Sharpness Aware Minimization (SAM) suggests perturbing parameters before gradient calculation by a gradient ascent step to guide the optimization into parameter space regions of flat loss. While significant generalization improvements and thus reduction of overfitting could be demonstrated, the computational costs are doubled due to the additionally needed gradient calculation, making SAM unfeasible in case of limited computationally capacities. Motivated by Nesterov Accelerated Gradient (NAG) we propose Momentum-SAM (MSAM), which perturbs parameters in the direction of the accumulated momentum vector to achieve low sharpness without significant computational overhead or memory demands over SGD or Adam. We evaluate MSAM in detail and reveal insights on separable mechanisms of NAG, SAM and MSAM regarding training optimization and generalization. Code is available at https://github.com/MarlonBecker/MSAM.
翻译:近期提出的深度神经网络优化算法锐度感知最小化(SAM)建议在梯度计算前通过梯度上升步骤扰动参数,以引导优化进入损失平坦的参数空间区域。虽然该方法能显著提升泛化性能并减少过拟合,但由于需要额外的梯度计算,其计算成本增加了一倍,使得SAM在计算资源有限的情况下难以实施。受Nesterov加速梯度(NAG)启发,我们提出动量SAM(MSAM),该方法沿累积动量向量的方向扰动参数,以实现低锐度特性,同时相比SGD或Adam不会产生显著的计算开销或内存需求。我们详细评估了MSAM,并揭示了NAG、SAM和MSAM在训练优化与泛化方面的可分离机制特性。代码发布于https://github.com/MarlonBecker/MSAM。