This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation. Weight decay can cause the expected magnitude and angular updates of a neuron's weight vector to converge to a steady state we call rotational equilibrium. These states can be highly homogeneous, effectively balancing the average rotation -- a proxy for the effective learning rate -- across different layers and neurons. Our work analyzes these dynamics across optimizers like Adam, Lion, and SGD with momentum, offering a new simple perspective on training that elucidates the efficacy of widely used but poorly understood methods in deep learning. We demonstrate how balanced rotation plays a key role in the effectiveness of normalization like Weight Standardization, as well as that of AdamW over Adam with L2-regularization. Finally, we show that explicitly controlling the rotation provides the benefits of weight decay while substantially reducing the need for learning rate warmup.
翻译:本研究通过应用分析与实验相结合的方式,探讨了权重衰减如何影响深度神经网络中单个神经元的更新行为。权重衰减会导致神经元权重向量的期望幅度和角度更新收敛到一个稳态,我们称之为旋转平衡。这些状态可能具有高度同质性,能有效平衡不同层和神经元间的平均旋转——这是有效学习率的一个代理指标。我们的工作分析了包括Adam、Lion和带动量的SGD在内的多种优化器中的这些动态,为训练过程提供了一个新颖而简洁的视角,从而阐明了深度学习中广泛使用但理解不足的方法的有效性。我们证明了平衡旋转在权重标准化等归一化技术的有效性中,以及在AdamW相较于带L2正则化的Adam的优势中,都起着关键作用。最后,我们表明,显式控制旋转既能获得权重衰减的益处,又能大幅减少对学习率预热的需求。