It is generally accepted that starting neural networks training with large learning rates (LRs) improves generalization. Following a line of research devoted to understanding this effect, we conduct an empirical study in a controlled setting focusing on two questions: 1) how large an initial LR is required for obtaining optimal quality, and 2) what are the key differences between models trained with different LRs? We discover that only a narrow range of initial LRs slightly above the convergence threshold lead to optimal results after fine-tuning with a small LR or weight averaging. By studying the local geometry of reached minima, we observe that using LRs from this optimal range allows for the optimization to locate a basin that only contains high-quality minima. Additionally, we show that these initial LRs result in a sparse set of learned features, with a clear focus on those most relevant for the task. In contrast, starting training with too small LRs leads to unstable minima and attempts to learn all features simultaneously, resulting in poor generalization. Conversely, using initial LRs that are too large fails to detect a basin with good solutions and extract meaningful patterns from the data.
翻译:普遍认为,以较大的学习率开始神经网络训练能够提升泛化性能。遵循一系列致力于理解此效应的研究路线,我们在受控环境中开展了一项实证研究,重点关注两个问题:1)获得最优质量需要多大的初始学习率;2)不同学习率训练出的模型之间存在哪些关键差异?我们发现,只有在略高于收敛阈值的狭窄初始学习率范围内,经过小学习率微调或权重平均后才能获得最优结果。通过研究所达极小值的局部几何特性,我们观察到使用此最优范围内的学习率能使优化过程定位到一个仅包含高质量极小值的盆地。此外,我们证明这些初始学习率会产生稀疏的已学习特征集,并清晰聚焦于任务最相关的特征。相比之下,以过小的学习率开始训练会导致不稳定极小值,并试图同时学习所有特征,从而导致泛化能力低下。反之,使用过大的初始学习率则无法探测到包含良好解的盆地,也无法从数据中提取有意义的模式。