Recently, flat-minima optimizers, which seek to find parameters in low-loss neighborhoods, have been shown to improve a neural network's generalization performance over stochastic and adaptive gradient-based optimizers. Two methods have received significant attention due to their scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness-Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them across different domains. We fill this gap here by comparing the loss surfaces of the models trained with each method and through broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover several surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.
翻译:最近,平坦最小值优化器(旨在寻找低损失邻域内的参数)已被证明能够比随机和自适应梯度优化器更好地提升神经网络的泛化性能。两种方法因其可扩展性而受到广泛关注:1. 随机权重平均(SWA),2. 锐度感知最小化(SAM)。然而,目前对其性质的研究有限,且缺乏跨领域的系统性基准测试。我们通过比较每种方法训练模型的损失曲面,以及跨计算机视觉、自然语言处理和图表征学习任务的广泛基准测试,填补了这一空白。从这些结果中,我们发现了若干令人惊讶的发现,希望这些发现能帮助研究人员进一步改进深度学习优化器,并使从业者能够为其问题选择正确的优化器。