In this short note we consider random fully connected ReLU networks of width $n$ and depth $L$ equipped with a mean-field weight initialization. Our purpose is to study the dependence on $n$ and $L$ of the maximal update ($\mu$P) learning rate, the largest learning rate for which the mean squared change in pre-activations after one step of gradient descent remains uniformly bounded at large $n,L$. As in prior work on $\mu$P of Yang et. al., we find that this maximal update learning rate is independent of $n$ for all but the first and last layer weights. However, we find that it has a non-trivial dependence of $L$, scaling like $L^{-3/2}.$
翻译:本文简要探讨了宽度为$n$、深度为$L$的随机全连接ReLU网络,并采用平均场权重初始化。我们的目标是研究最大更新($\mu$P)学习率对$n$和$L$的依赖性——即经过一步梯度下降后,使预激活均方变化在$n,L$较大时保持一致有界的最大学习率。与Yang等人先前关于$\mu$P的研究一致,我们发现除第一层和最后一层权重外,该最大更新学习率与$n$无关。然而,我们发现它对$L$具有非平凡依赖性,其标度行为表现为$L^{-3/2}$。