By driving models to converge to flat minima, sharpness-aware learning algorithms (such as SAM) have shown the power to achieve state-of-the-art performances. However, these algorithms will generally incur one extra forward-backward propagation at each training iteration, which largely burdens the computation especially for scalable models. To this end, we propose a simple yet efficient training scheme, called Randomized Sharpness-Aware Training (RST). Optimizers in RST would perform a Bernoulli trial at each iteration to choose randomly from base algorithms (SGD) and sharpness-aware algorithms (SAM) with a probability arranged by a predefined scheduling function. Due to the mixture of base algorithms, the overall count of propagation pairs could be largely reduced. Also, we give theoretical analysis on the convergence of RST. Then, we empirically study the computation cost and effect of various types of scheduling functions, and give directions on setting appropriate scheduling functions. Further, we extend the RST to a general framework (G-RST), where we can adjust regularization degree on sharpness freely for any scheduling function. We show that G-RST can outperform SAM in most cases while saving 50\% extra computation cost.
翻译:通过引导模型收敛至平坦极小值,锐度感知学习算法(如SAM)展现了实现最先进性能的能力。然而,这些算法通常会在每次训练迭代中额外进行一次前向-反向传播,这极大地加重了计算负担,尤其对于可扩展模型而言。为此,我们提出一种简单而高效的训练方案,称为随机锐度感知训练(RST)。RST中的优化器会在每次迭代中执行伯努利试验,以预定义的调度函数所确定的概率,随机从基础算法(SGD)和锐度感知算法(SAM)中选择。由于基础算法的混合使用,传播步对的总体数量可大幅减少。同时,我们对RST的收敛性进行了理论分析。随后,我们通过实验研究了各类调度函数的计算成本与效果,并给出了设置合适调度函数的指导方针。进一步地,我们将RST扩展为通用框架(G-RST),在该框架中,对于任意调度函数均可自由调整锐度正则化程度。实验表明,G-RST在节省50%额外计算成本的同时,在大多数情况下性能优于SAM。