Weight decay is a broadly used technique for training state-of-the-art deep networks, including large language models. Despite its widespread usage, its role remains poorly understood. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For overparameterized deep networks, we show how weight decay modifies the optimization dynamics enhancing the ever-present implicit regularization of SGD via the loss stabilization mechanism. In contrast, for underparameterized large language models trained with nearly online SGD, we describe how weight decay balances the bias-variance tradeoff in stochastic optimization leading to lower training loss. Moreover, we show that weight decay also prevents sudden loss divergences for bfloat16 mixed-precision training which is a crucial tool for LLM training. Overall, we present a unifying perspective from ResNets on vision tasks to LLMs: weight decay is never useful as an explicit regularizer but instead changes the training dynamics in a desirable way. Our code is available at https://github.com/tml-epfl/why-weight-decay.
翻译:权重衰减是训练包括大语言模型在内的最先进深度网络时广泛使用的技术。尽管应用普遍,但其作用机制仍未被充分理解。本研究揭示,现代深度学习中权重衰减的角色与经典学习理论中研究的正则化效应存在本质差异。对于过参数化深度网络,我们展示了权重衰减如何通过损失稳定机制修改优化动态,增强SGD中始终存在的隐式正则化。相反,在近乎在线SGD训练的欠参数化大语言模型中,我们描述了权重衰减如何平衡随机优化的偏差-方差权衡以降低训练损失。此外,我们证明权重衰减还能防止bfloat16混合精度训练(大语言模型训练的关键工具)中突发的损失发散。总体而言,我们提出了从视觉任务ResNet到大语言模型的统一视角:权重衰减从未作为显式正则化器发挥作用,而是以理想方式改变训练动态。我们的代码发布于https://github.com/tml-epfl/why-weight-decay。