AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning Rate and Momentum for Training Deep Neural Networks

Sharpness aware minimization (SAM) optimizer has been extensively explored as it can generalize better for training deep neural networks via introducing extra perturbation steps to flatten the landscape of deep learning models. Integrating SAM with adaptive learning rate and momentum acceleration, dubbed AdaSAM, has already been explored empirically to train large-scale deep neural networks without theoretical guarantee due to the triple difficulties in analyzing the coupled perturbation step, adaptive learning rate and momentum step. In this paper, we try to analyze the convergence rate of AdaSAM in the stochastic non-convex setting. We theoretically show that AdaSAM admits a $\mathcal{O}(1/\sqrt{bT})$ convergence rate, which achieves linear speedup property with respect to mini-batch size $b$. Specifically, to decouple the stochastic gradient steps with the adaptive learning rate and perturbed gradient, we introduce the delayed second-order momentum term to decompose them to make them independent while taking an expectation during the analysis. Then we bound them by showing the adaptive learning rate has a limited range, which makes our analysis feasible. To the best of our knowledge, we are the first to provide the non-trivial convergence rate of SAM with an adaptive learning rate and momentum acceleration. At last, we conduct several experiments on several NLP tasks, which show that AdaSAM could achieve superior performance compared with SGD, AMSGrad, and SAM optimizers.

翻译：锐度感知最小化（Sharpness Aware Minimization, SAM）优化器通过引入额外的扰动步骤来平缓深度学习模型的损失景观，从而在训练深度神经网络时展现出更优的泛化能力，因此得到了广泛研究。将SAM与自适应学习率和动量加速相结合（称为AdaSAM）已在训练大规模深度神经网络的实证探索中取得进展，但由于扰动步骤、自适应学习率和动量步骤三者耦合带来的三重分析困难，该方法的理论保证至今缺失。本文尝试在随机非凸场景下分析AdaSAM的收敛速率。理论证明，AdaSAM具有$\mathcal{O}(1/\sqrt{bT})$的收敛速率，并展现出关于小批量大小$b$的线性加速性质。具体而言，为解耦随机梯度步骤与自适应学习率及扰动梯度之间的关系，我们引入了延迟的二阶动量项进行分解，使它们在期望计算中相互独立。随后通过证明自适应学习率存在有限范围来约束这些项，从而保证分析可行性。据我们所知，这是首个为配备自适应学习率和动量加速的SAM提供非平凡收敛速率的工作。最后，我们在多个NLP任务上进行了实验，结果表明AdaSAM相较SGD、AMSGrad和SAM优化器能够取得更优性能。

相关内容

自适应学习

关注 10

自适应学习，也被称为自适应教学，是使用计算机算法来协调与学习者的互动，并提供定制学习资源和学习活动来解决每个学习者的独特需求的教育方法。在专业的学习情境，个人可以“试验出”一些训练方式，以确保教学内容的更新。根据学生的学习需要，计算机生成适应其特点的教育材料，包括他们对问题的回答和完成的任务和经验。该技术涵盖了各个研究领域和它们的衍生，包括计算机科学、人工智能、心理测验、教育学、心理学和脑科学。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

46+阅读 · 2020年10月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日