Excursions in gradient magnitude pose a persistent challenge when training deep networks. In this paper, we study the early training phases of deep normalized ReLU networks, accounting for the induced scale invariance by examining effective learning rates (LRs). Starting with the well-known fact that batch normalization (BN) leads to exponentially exploding gradients at initialization, we develop an ODE-based model to describe early training dynamics. Our model predicts that in the gradient flow, effective LRs will eventually equalize, aligning with empirical findings on warm-up training. Using large LRs is analogous to applying an explicit solver to a stiff non-linear ODE, causing overshooting and vanishing gradients in lower layers after the first step. Achieving overall balance demands careful tuning of LRs, depth, and (optionally) momentum. Our model predicts the formation of spreads in effective LRs, consistent with empirical measurements. Moreover, we observe that large spreads in effective LRs result in training issues concerning accuracy, indicating the importance of controlling these dynamics. To further support a causal relationship, we implement a simple scheduling scheme prescribing uniform effective LRs across layers and confirm accuracy benefits.
翻译:梯度幅值的波动是深度网络训练中持续存在的挑战。本文通过研究有效学习率,探讨了深度归一化ReLU网络在早期训练阶段的表现,并基于尺度不变性进行分析。从批量归一化在初始化阶段导致梯度指数级爆炸这一已知现象出发,我们建立了一个基于常微分方程的模型来描述早期训练动态。该模型预测,在梯度流中,有效学习率最终会趋于相等,这与暖启动训练的经验结果一致。使用大的学习率类似于对刚性非线性常微分方程应用显式求解器,会导致第一步后低层出现过冲和梯度消失。实现整体平衡需要仔细调整学习率、网络深度以及(可选的)动量项。模型预测了有效学习率中差异的形成,与经验测量结果一致。此外,我们发现有效学习率中的较大差异会导致与精度相关的训练问题,这表明控制这些动态的重要性。为进一步支持因果关系的存在,我们实施了一个简单的调度方案,使各层有效学习率保持一致,并证实了其对精度的提升效果。