Exponential Moving Average (EMA) is a widely used weight averaging (WA) regularization to learn flat optima for better generalizations without extra cost in deep neural network (DNN) optimization. Despite achieving better flatness, existing WA methods might fall into worse final performances or require extra test-time computations. This work unveils the full potential of EMA with a single line of modification, i.e., switching the EMA parameters to the original model after each epoch, dubbed as Switch EMA (SEMA). From both theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to reach generalization optima that better trade-off between flatness and sharpness. To verify the effectiveness of SEMA, we conduct comparison experiments with discriminative, generative, and regression tasks on vision and language datasets, including image classification, self-supervised learning, object detection and segmentation, image generation, video prediction, attribute regression, and language modeling. Comprehensive results with popular optimizers and networks show that SEMA is a free lunch for DNN training by improving performances and boosting convergence speeds.
翻译:指数移动平均(EMA)作为一种广泛使用的权重平均(WA)正则化方法,可在深度神经网络(DNN)优化中无需额外代价地学习平坦最优解以提升泛化能力。然而现有WA方法虽能实现更好的平坦性,却可能导致最终性能下降或需要额外的测试时计算。本文通过单行代码修改揭示了EMA的全部潜力:在每个训练周期后将EMA参数切换回原始模型,称之为Switch EMA(SEMA)。我们从理论和实证两个层面证明,SEMA能帮助DNN在平坦性和尖锐性之间实现更优权衡的泛化最优解。为验证SEMA有效性,我们在视觉和语言数据集上开展了包含判别、生成与回归任务的对比实验,涵盖图像分类、自监督学习、目标检测与分割、图像生成、视频预测、属性回归及语言建模等领域。结合主流优化器与网络的综合结果表明,SEMA通过提升性能与加速收敛,为深度神经网络训练提供了名副其实的免费午餐。