Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

By classifying infinite-width neural networks and identifying the *optimal* limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P, for *widthwise hyperparameter transfer*, i.e., predicting optimal hyperparameters of wide neural networks from narrow ones. Here we investigate the analogous classification for *depthwise parametrizations* of deep residual networks (resnets). We classify depthwise parametrizations of block multiplier and learning rate by their infinite-width-then-depth limits. In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth-$\mu$P that extends $\mu$P and show empirically it admits depthwise hyperparameter transfer. We identify *feature diversity* as a crucial factor in deep networks, and Depth-$\mu$P can be characterized as maximizing both feature learning and feature diversity. Exploiting this, we find that absolute value, among all homogeneous nonlinearities, maximizes feature diversity and indeed empirically leads to significantly better performance. However, if each block is deeper (such as modern transformers), then we find fundamental limitations in all possible infinite-depth limits of such parametrizations, which we illustrate both theoretically and empirically on simple networks as well as Megatron transformer trained on Common Crawl.

翻译：通过对无限宽度神经网络进行分类并识别*最优*极限，张量程序 IV 和 V 展示了一种称为 $\mu$P 的通用方法，用于*宽度方向超参数迁移*，即从窄网络预测宽网络的最优超参数。本文探索了深层残差网络（resnets）的*深度方向参数化*的类似分类。我们根据其无限宽度后深度的极限，对块乘子和学习率的深度方向参数化进行分类。在每块仅包含一层的残差网络中，我们识别出一种独特的最优参数化，称为 Depth-$\mu$P，它扩展了 $\mu$P，并通过实验证明其支持深度方向超参数迁移。我们将*特征多样性*识别为深度网络中的一个关键因素，而 Depth-$\mu$P 可被刻画为同时最大化特征学习和特征多样性。利用这一发现，我们发现在所有齐次非线性中，绝对值能够最大化特征多样性，并在实验上确实带来显著更优的性能。然而，如果每个块更深（例如现代Transformer），则此类参数化的所有可能无限深度极限都存在根本性限制。我们通过简单网络以及基于 Common Crawl 训练的 Megatron Transformer，在理论上和实验上均阐明了这一点。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日