This work presents constrained parameter regularization (CPR), an alternative to traditional weight decay. Instead of applying a constant penalty uniformly to all parameters, we enforce an upper bound on a statistical measure (e.g., the L$_2$-norm) of individual parameter groups. This reformulates learning as a constrained optimization problem. To solve this, we utilize an adaptation of the augmented Lagrangian method. Our approach allows for varying regularization strengths across different parameter groups, removing the need for explicit penalty coefficients in the regularization terms. CPR only requires two hyperparameters and introduces no measurable runtime overhead. We offer empirical evidence of CPR's effectiveness through experiments in the "grokking" phenomenon, image classification, and language modeling. Our findings show that CPR can counteract the effects of grokking, and it consistently matches or surpasses the performance of traditional weight decay.
翻译:本文提出约束参数正则化(CPR)作为传统权重衰减的替代方案。与对所有参数统一施加恒定惩罚不同,我们对各参数组的统计度量(例如L$_2$范数)强制执行上界约束,从而将学习问题重构为有约束优化问题。为解决该问题,我们采用了增广拉格朗日方法的一种变体。该方法允许不同参数组具有可变的正则化强度,无需在正则化项中显式设置惩罚系数。CPR仅需两个超参数,且不会引入可测量的运行时开销。我们通过"顿悟"现象、图像分类和语言建模等实验提供了CPR有效性的实证证据。结果表明,CPR能够抑制"顿悟"效应,并在性能上始终匹配或超越传统权重衰减方法。