By classifying infinite-width neural networks and identifying the *optimal* limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P, for *widthwise hyperparameter transfer*, i.e., predicting optimal hyperparameters of wide neural networks from narrow ones. Here we investigate the analogous classification for *depthwise parametrizations* of deep residual networks (resnets). We classify depthwise parametrizations of block multiplier and learning rate by their infinite-width-then-depth limits. In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth-$\mu$P that extends $\mu$P and show empirically it admits depthwise hyperparameter transfer. We identify *feature diversity* as a crucial factor in deep networks, and Depth-$\mu$P can be characterized as maximizing both feature learning and feature diversity. Exploiting this, we find that absolute value, among all homogeneous nonlinearities, maximizes feature diversity and indeed empirically leads to significantly better performance. However, if each block is deeper (such as modern transformers), then we find fundamental limitations in all possible infinite-depth limits of such parametrizations, which we illustrate both theoretically and empirically on simple networks as well as Megatron transformer trained on Common Crawl.
翻译:通过分类无限宽度神经网络并识别*最优*极限,张量程序IV和V展示了一种通用方法(称为μP),用于实现*宽度方向超参数迁移*,即从窄网络预测宽网络的最优超参数。本文对深度残差网络(resnets)的*深度方向参数化*进行类似分类研究。我们根据无限宽度再取深度极限,对块乘子和学习率的深度方向参数化进行分类。在每块仅含一层的残差网络中,我们识别出一种独特的最优参数化(称为Depth-μP),它扩展了μP,并实验证明其支持深度方向超参数迁移。我们指出*特征多样性*是深度网络的关键因素,而Depth-μP可被表征为同时最大化特征学习和特征多样性。利用这一特性,我们发现绝对值非线性函数在所有齐次非线性中能最大化特征多样性,并在实验中显著提升性能。然而,当每个块更深时(如现代Transformer),我们发现在此类参数化的所有可能无限深度极限中存在根本性局限——这一结论在简单网络以及基于Common Crawl训练的Megatron Transformer上均得到理论与实验验证。