We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.
翻译:本文提出谨慎权重衰减(CWD),这是一种单行、与优化器无关的修改方法,仅将权重衰减应用于其符号与优化器更新方向一致的参数坐标上。与标准解耦衰减(其隐式优化正则化或约束目标)不同,CWD保留了原始损失函数,并允许一种双层解释:它在到达平稳流形时诱导滑模行为,从而使其能够搜索未修改目标的局部帕累托最优平稳点。在实践中,CWD可直接替换用于AdamW、Lion和Muon等优化器,无需引入新的超参数或额外调优。对于语言模型预训练和ImageNet分类任务,在百万至十亿参数规模下,CWD持续改善了最终损失和准确率。