Deep neural networks based on batch normalization and ReLU-like activation functions can experience instability during the early stages of training due to the high gradient induced by temporal gradient explosion. We explain how ReLU reduces variance more than expected, and how batch normalization amplifies the gradient during recovery, which causes gradient explosion while forward propagation remains stable. Additionally, we discuss how the dynamics of a deep neural network change during training and how the correlation between inputs can alleviate this problem. Lastly, we propose a better adaptive learning rate algorithm inspired by second-order optimization algorithms, which outperforms existing learning rate scaling methods in large batch training and can also replace WarmUp in small batch training.
翻译:基于批归一化和ReLU类激活函数的深度神经网络在训练初期可能因时间梯度爆炸引发的高梯度而产生不稳定性。我们阐述了ReLU如何导致方差非预期性降低,以及批归一化如何在恢复阶段放大梯度——这在前向传播保持稳定时仍会引发梯度爆炸。此外,我们探讨了深度神经网络在训练过程中的动力学演化机制,以及输入间的相关性如何缓解该问题。最后,我们提出一种受二阶优化算法启发的改进自适应学习率算法,该算法在大批量训练中优于现有学习率缩放方法,同时可在小批量训练中替代WarmUp策略。