Recent analyses of neural networks with shaped activations (i.e. the activation function is scaled as the network size grows) have led to scaling limits described by differential equations. However, these results do not a priori tell us anything about "ordinary" unshaped networks, where the activation is unchanged as the network size grows. In this article, we find similar differential equation based asymptotic characterization for two types of unshaped networks. Firstly, we show that the following two architectures converge to the same infinite-depth-and-width limit at initialization: (i) a fully connected ResNet with a $d^{-1/2}$ factor on the residual branch, where $d$ is the network depth. (ii) a multilayer perceptron (MLP) with depth $d \ll$ width $n$ and shaped ReLU activation at rate $d^{-1/2}$. Secondly, for an unshaped MLP at initialization, we derive the first order asymptotic correction to the layerwise correlation. In particular, if $\rho_\ell$ is the correlation at layer $\ell$, then $q_t = \ell^2 (1 - \rho_\ell)$ with $t = \frac{\ell}{n}$ converges to an SDE with a singularity at $t=0$. These results together provide a connection between shaped and unshaped network architectures, and opens up the possibility of studying the effect of normalization methods and how it connects with shaping activation functions.
翻译:近期对具有塑性激活函数(即随网络规模增长而缩放激活函数)的神经网络的分析,导出了由微分方程描述的缩放极限。然而,这些结果并未先验地揭示关于普通"无塑性"网络(即激活函数随网络规模增长保持不变)的任何信息。本文针对两类无塑性网络,基于微分方程建立了类似的渐近特征描述。首先,我们证明以下两种架构在初始化时收敛至相同的无限深度-宽度极限:(i)残差分支上带有$d^{-1/2}$因子的全连接残差网络(ResNet),其中$d$为网络深度;(ii)深度$d \ll$宽度$n$且以速率$d^{-1/2}$进行塑性的ReLU激活的多层感知机(MLP)。其次,针对初始化时的无塑性MLP,我们推导了逐层相关性的首阶渐近修正项。特别地,若$\rho_\ell$为第$\ell$层的相关性,则$q_t = \ell^2 (1 - \rho_\ell)$(其中$t = \frac{\ell}{n}$)收敛至一个在$t=0$处存在奇异性的随机微分方程。这些结果共同建立了有塑性与无塑性网络架构之间的关联,并为研究归一化方法的效果及其与激活函数塑性化之间的关联开辟了可能性。