We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $μ$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that under $μP$, the optimal learning rate converges to a \emph{non-zero constant} as width goes to infinity, providing a theoretical explanation to learning rate transfer. In contrast, we show that this property fails to hold under alternative parametrizations such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP). We provide intuitive proofs and support the theoretical findings with extensive empirical results.
翻译:在线性多层感知机(MLP)中,我们首次给出了在μP参数化下学习率随宽度迁移的证明。μP是一种旨在在无限宽度极限下“最大化”特征学习的神经网络参数化方法。我们证明,在μP下,当宽度趋于无穷时,最优学习率收敛于一个非零常数,从而为学习率迁移现象提供了理论解释。相比之下,我们表明这一性质在其他参数化方法(如标准参数化(SP)和神经正切参数化(NTP))下并不成立。我们提供了直观的证明,并通过大量实验结果支持了理论发现。