Why do neural networks trained with large learning rates for a longer time often lead to better generalization? In this paper, we delve into this question by examining the relation between training and testing loss in neural networks. Through visualization of these losses, we note that the training trajectory with a large learning rate navigates through the minima manifold of the training loss, finally nearing the neighborhood of the testing loss minimum. Motivated by these findings, we introduce a nonlinear model whose loss landscapes mirror those observed for real neural networks. Upon investigating the training process using SGD on our model, we demonstrate that an extended phase with a large learning rate steers our model towards the minimum norm solution of the training loss, which may achieve near-optimal generalization, thereby affirming the empirically observed benefits of late learning rate decay.
翻译:为什么以较大学习率长时间训练的神经网络往往能实现更好的泛化性能?本文通过探讨神经网络训练损失与测试损失之间的关系来深入剖析这一问题。通过可视化两类损失曲线,我们发现采用大学习率的训练轨迹会穿越训练损失的最小值流形,最终接近测试损失最小值邻域。基于这些发现,我们引入一个非线性模型,其损失景观与真实神经网络观察到的景观高度吻合。通过对该模型使用随机梯度下降的训练过程进行研究,我们证明:延长大学习率阶段会引导模型趋向训练损失的最小范数解,该解能够实现接近最优的泛化性能,从而验证了延迟学习率衰减在经验观察中的实际效益。