Recent studies have shown that high disparities in effective learning rates (ELRs) across layers in deep neural networks can negatively affect trainability. We formalize how these disparities evolve over time by modeling weight dynamics (evolution of expected gradient and weight norms) of networks with normalization layers, predicting the evolution of layer-wise ELR ratios. We prove that when training with any constant learning rate, ELR ratios converge to 1, despite initial gradient explosion. We identify a ``critical learning rate" beyond which ELR disparities widen, which only depends on current ELRs. To validate our findings, we devise a hyper-parameter-free warm-up method that successfully minimizes ELR spread quickly in theory and practice. Our experiments link ELR spread with trainability, a relationship that is most evident in very deep networks with significant gradient magnitude excursions.
翻译:近期研究表明,深度神经网络中各层间有效学习率(ELR)的显著差异可能对网络的可训练性产生负面影响。本文通过建立含归一化层网络的权重动态模型(梯度期望与权重范数的演化),形式化地描述了这种差异随时间演化的规律,并预测了层间ELR比值的演化过程。我们证明,在使用任意恒定学习率训练时,尽管存在初始梯度爆炸现象,ELR比值最终会收敛至1。我们发现存在一个仅取决于当前ELR值的"临界学习率",超过该阈值后ELR差异将扩大。为验证理论发现,我们设计了一种无需超参数调节的预热方法,该方法在理论与实践上均能快速最小化ELR离散度。实验结果表明ELR离散度与可训练性存在关联,这种关联在具有显著梯度幅度波动的极深网络中表现得尤为明显。