Regularization is a critical component in deep learning training, with weight decay being a commonly used approach. It applies a constant penalty coefficient uniformly across all parameters. This may be unnecessarily restrictive for some parameters, while insufficiently restricting others. To dynamically adjust penalty coefficients for different parameter groups, we present constrained parameter regularization (CPR) as an alternative to traditional weight decay. Instead of applying a single constant penalty to all parameters, we enforce an upper bound on a statistical measure (e.g., the L$_2$-norm) of parameter groups. Consequently, learning becomes a constraint optimization problem, which we address by an adaptation of the augmented Lagrangian method. CPR only requires two hyperparameters and incurs no measurable runtime overhead. Additionally, we propose a simple but efficient mechanism to adapt the upper bounds during the optimization. We provide empirical evidence of CPR's efficacy in experiments on the "grokking" phenomenon, computer vision, and language modeling tasks. Our results demonstrate that CPR counteracts the effects of grokking and consistently matches or outperforms traditional weight decay.
翻译:正则化是深度学习训练中的关键组成部分,权重衰减是其中常用的方法。它对所有参数统一施加恒定的惩罚系数,这可能导致某些参数受到不必要的限制,而另一些参数则约束不足。为动态调整不同参数组的惩罚系数,我们提出约束参数正则化(CPR)作为传统权重衰减的替代方案。不同于对所有参数施加单一恒定惩罚,我们对参数组的统计量(例如L₂范数)施加上界约束。由此,学习过程转化为约束优化问题,我们通过自适应增广拉格朗日方法加以求解。CPR仅需两个超参数,且不会带来可测量的运行时开销。此外,我们提出了一种简单高效的机制,在优化过程中自适应调整上界。我们在“顿悟”(grokking)现象、计算机视觉及语言建模任务的实验中提供了CPR有效性的实证证据。结果表明,CPR能够抑制顿悟效应的影响,并在性能上持续匹敌或超越传统权重衰减。