Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning

Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit ($\mu$P and its depth extension), then some hyperparameters - such as the learning rate - exhibit transfer from small to very large models, thus reducing the cost of hyperparameter tuning. From an optimization perspective, this phenomenon is puzzling, as it implies that the loss landscape is remarkably consistent across very different model sizes. In this work, we find empirical evidence that learning rate transfer can be attributed to the fact that under $\mu$P and its depth extension, the largest eigenvalue of the training loss Hessian (i.e. the sharpness) is largely independent of the width and depth of the network for a sustained period of training time. On the other hand, we show that under the neural tangent kernel (NTK) regime, the sharpness exhibits very different dynamics at different scales, thus preventing learning rate transfer. But what causes these differences in the sharpness dynamics? Through a connection between the spectra of the Hessian and the NTK matrix, we argue that the cause lies in the presence (for $\mu$P) or progressive absence (for the NTK regime) of feature learning, which results in a different evolution of the NTK, and thus of the sharpness. We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from ResNets and Vision Transformers trained on benchmark vision datasets to Transformers-based language models trained on WikiText

翻译：近期，越来越多的证据表明，若神经网络的宽度与深度向着所谓的丰富特征学习极限（$\mu$P 及其深度扩展）进行缩放，则某些超参数（如学习率）能够从小模型迁移至非常大的模型，从而降低超参数调优的成本。从优化角度看，这一现象令人费解，因为它意味着损失函数的地形在不同规模的模型之间具有显著一致性。本工作中，我们发现了学习率迁移的实验证据，可归因于在 $\mu$P 及其深度扩展下，训练损失 Hessian 矩阵的最大特征值（即锐度）在训练持续期内很大程度上与网络的宽度和深度无关。另一方面，我们展示了在神经正切核（NTK）机制下，锐度在不同尺度下表现出截然不同的动态特性，从而阻止了学习率迁移。那么，是什么导致了锐度动态特性的这些差异？通过 Hessian 谱与 NTK 矩阵谱之间的关联，我们认为原因在于特征学习的存在（对于 $\mu$P）或逐步缺失（对于 NTK 机制），这导致了 NTK 的不同演化，进而影响锐度。我们通过一系列广泛的实验验证了上述论断，实验覆盖了多种数据集与架构：从基于基准视觉数据集训练的 ResNet 与 Vision Transformer，到基于 WikiText 训练的 Transformer 语言模型。