It is common in deep learning to warm up the learning rate $\eta$, often by a linear schedule between $\eta_{\text{init}} = 0$ and a predetermined target $\eta_{\text{trgt}}$. In this paper, we show through systematic experiments using SGD and Adam that the overwhelming benefit of warmup arises from allowing the network to tolerate larger $\eta_{\text{trgt}}$ by forcing the network to more well-conditioned areas of the loss landscape. The ability to handle larger $\eta_{\text{trgt}}$ makes hyperparameter tuning more robust while improving the final performance. We uncover different regimes of operation during the warmup period, depending on whether training starts off in a progressive sharpening or sharpness reduction phase, which in turn depends on the initialization and parameterization. Using these insights, we show how $\eta_{\text{init}}$ can be properly chosen by utilizing the loss catapult mechanism, which saves on the number of warmup steps, in some cases completely eliminating the need for warmup. We also suggest an initialization for the variance in Adam which provides benefits similar to warmup.
翻译:在深度学习中,通常会对学习率 $\eta$ 进行预热,常见做法是采用从 $\eta_{\text{init}} = 0$ 到预设目标值 $\eta_{\text{trgt}}$ 的线性调度。本文通过使用 SGD 和 Adam 优化器进行的系统实验表明,预热的主要益处在于:通过迫使网络进入损失函数景观中条件更好的区域,使得网络能够容忍更大的 $\eta_{\text{trgt}}$。这种处理更大 $\eta_{\text{trgt}}$ 的能力不仅提高了最终性能,还使超参数调优更加鲁棒。我们揭示了预热期间的不同运行机制,具体取决于训练是从渐进锐化阶段还是锐度降低阶段开始,而这又取决于初始化和参数化方式。基于这些发现,我们展示了如何利用损失弹射机制来合理选择 $\eta_{\text{init}}$,从而减少预热步数,在某些情况下甚至可以完全取消预热。我们还提出了一种 Adam 优化器中方差参数的初始化方法,该方法能提供与预热相似的好处。