Weight decay is a simple yet powerful regularization technique that has been very widely used in training of deep neural networks (DNNs). While weight decay has attracted much attention, previous studies fail to discover some overlooked pitfalls on large gradient norms resulted by weight decay. In this paper, we discover that, weight decay can unfortunately lead to large gradient norms at the final phase (or the terminated solution) of training, which often indicates bad convergence and poor generalization. To mitigate the gradient-norm-centered pitfalls, we present the first practical scheduler for weight decay, called the Scheduled Weight Decay (SWD) method that can dynamically adjust the weight decay strength according to the gradient norm and significantly penalize large gradient norms during training. Our experiments also support that SWD indeed mitigates large gradient norms and often significantly outperforms the conventional constant weight decay strategy for Adaptive Moment Estimation (Adam).
翻译:权重衰减是一种简单但有效的正则化技术,在深度神经网络训练中被广泛应用。尽管权重衰减已受到广泛关注,但先前的研究未能发现其在训练中导致梯度范数过大这一易被忽视的陷阱。本文发现,权重衰减可能在训练末期(或终止解处)意外地导致梯度范数过大,这通常预示着较差的收敛性和泛化性能。为缓解以梯度范数为中心的陷阱,我们首次提出一种实用的权重衰减调度器——计划权重衰减方法,该方法能根据梯度范数动态调整权重衰减强度,并在训练过程中显著抑制过大的梯度范数。实验证明,SWD确实能缓解大梯度范数问题,且通常显著优于自适应矩估计中常规使用的恒定权重衰减策略。