Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity, the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows one to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new practical way to diagnose criticality. We introduce \emph{partial Jacobians} of a network, defined as derivatives of preactivations in layer $l$ with respect to preactivations in layer $l_0\leq l$. We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections. We derive and implement a simple and cheap numerical test that allows one to select optimal initialization for a broad class of deep neural networks; containing fully connected, convolutional and normalization layers. Using these tools we show quantitatively that proper stacking of the LayerNorm (applied to preactivations) and residual connections leads to an architecture that is critical for any initialization. Finally, we apply our methods to analyze ResNet and MLP-Mixer architectures; demonstrating the everywhere-critical regime.
翻译:深度神经网络因其难以进行理论分析而闻名。然而,当每层参数数量趋于无穷时,网络函数表现为高斯过程,可实现定量预测描述。高斯近似允许我们制定选择超参数(如权重和偏置的方差及学习率)的准则,这些准则依赖于深度神经网络中定义的临界性概念。本文描述了一种诊断临界性的新型实用方法。我们引入网络的*部分雅可比矩阵*,定义为第$l$层预激活相对于第$l_0\leq l$层预激活的导数。我们推导了部分雅可比矩阵范数的递推关系,并利用这些关系分析包含层归一化和/或残差连接的深度全连接网络的临界性。我们推导并实现了一种简单且低成本的数值测试方法,可为包含全连接层、卷积层和归一化层的广泛深度神经网络选择最优初始化。利用这些工具,我们定量证明:将层归一化(应用于预激活)与残差连接合理堆叠,可构建出对任意初始化均保持临界性的架构。最后,我们将方法应用于分析ResNet和MLP-Mixer架构,展示了其全临界性机制。