This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation. Weight decay can cause the expected magnitude and angular updates of a neuron's weight vector to converge to a steady state we call rotational equilibrium. These states can be highly homogeneous, effectively balancing the average rotation -- a proxy for the effective learning rate -- across different layers and neurons. Our work analyzes these dynamics across optimizers like Adam, Lion, and SGD with momentum, offering a new simple perspective on training that elucidates the efficacy of widely used but poorly understood methods in deep learning. We demonstrate how balanced rotation plays a key role in the effectiveness of normalization like Weight Standardization, as well as that of AdamW over Adam with L2-regularization. Finally, we show that explicitly controlling the rotation provides the benefits of weight decay while substantially reducing the need for learning rate warmup.
翻译:本研究通过应用分析与实验相结合的方式,探究了权重衰减对深度神经网络中单个神经元更新行为的影响机制。权重衰减可使神经元的权值向量在期望幅度和角度更新上收敛至一种称为"旋转平衡"的稳态。这些稳态呈现出高度同质性,有效均衡了不同层与神经元间的平均旋转——即有效学习率的代理指标。我们的工作分析了包括Adam、Lion及带动量的SGD等优化器下的动态特性,为理解深度学习中广泛使用但机理尚不明确的方法提供了一种新颖的简化训练视角。我们揭示了平衡旋转在权重标准化等归一化技术有效性中的关键作用,并阐释了AdamW相比采用L2正则化的Adam的优势所在。最后,研究表明显式控制旋转可保留权重衰减的收益,同时大幅降低对学习率预热策略的需求。