Recently, flat-minima optimizers, which seek to find parameters in low loss neighborhoods, have been shown to improve upon stochastic and adaptive gradient-based optimizers for training neural networks. Two methods have received significant attention due to their impressive generalization performance and scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them. Previous work mainly evaluated SWA and SAM on different architectures and datasets. We fill this gap here by comparing the loss surfaces of the models trained with each method and through a broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover a number of surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.
翻译:最近,在低损失社区寻找参数的平面平面平面优化器被证明改进了用于培训神经网络的随机和适应性梯度优化器。两种方法因其令人印象深刻的概括性表现和可扩展性而得到极大关注:1. 蒸汽湿度变化(SWA)和2. 锐化最小化(SAM ) 。然而,对其特性的调查有限,也没有对其进行系统的衡量。以前的工作主要评价了有关不同结构和数据集的SWA和SAM。我们通过将所培训模型的损失面与每种方法进行比较,并通过对计算机视觉、自然语言处理和图表表述学习任务进行广泛的基准测量,填补了这一差距。我们从这些结果中发现了一些令人惊讶的结果,我们希望这些结果将有助于研究人员进一步改进深层学习优化,以及从业人员确定他们问题的适当优化。