We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.
翻译:我们提出了谨慎权重衰减(CWD),这是一种单行、优化器无关的修改,仅将权重衰减应用于其符号与优化器更新方向一致的参数坐标上。与标准的解耦衰减(其隐式优化正则化或约束目标)不同,CWD 保留了原始损失函数,并允许一种双层解释:它在到达平稳流形时诱导出滑模行为,使其能够搜索未修改目标的局部帕累托最优平稳点。在实践中,CWD 是 AdamW、Lion 和 Muon 等优化器的即插即用式更改,无需新的超参数或额外调优。对于语言模型预训练和 ImageNet 分类任务,CWD 在百万到十亿参数规模上持续改善了最终损失和准确率。