The logit outputs of a feedforward neural network at initialization are conditionally Gaussian, given a random covariance matrix defined by the penultimate layer. In this work, we study the distribution of this random matrix. Recent work has shown that shaping the activation function as network depth grows large is necessary for this covariance matrix to be non-degenerate. However, the current infinite-width-style understanding of this shaping method is unsatisfactory for large depth: infinite-width analyses ignore the microscopic fluctuations from layer to layer, but these fluctuations accumulate over many layers. To overcome this shortcoming, we study the random covariance matrix in the shaped infinite-depth-and-width limit. We identify the precise scaling of the activation function necessary to arrive at a non-trivial limit, and show that the random covariance matrix is governed by a stochastic differential equation (SDE) that we call the Neural Covariance SDE. Using simulations, we show that the SDE closely matches the distribution of the random covariance matrix of finite networks. Additionally, we recover an if-and-only-if condition for exploding and vanishing norms of large shaped networks based on the activation function.
翻译:前馈神经网络在初始化时的logit输出条件服从高斯分布,且该分布由倒数第二层定义的随机协方差矩阵决定。本研究分析该随机矩阵的分布特性。最新研究表明,随着网络深度增加,必须对激活函数进行整形处理才能使该协方差矩阵保持非退化状态。然而,现有基于无限宽度视角对此整形方法的理解在深度较大时存在不足:无限宽度分析忽略了逐层微观涨落,而这些涨落会在多层网络中持续累积。为克服这一局限,我们在成形无限深度-宽度极限下研究随机协方差矩阵。我们确定了实现非平凡极限所需的激活函数精确缩放尺度,并证明随机协方差矩阵受制于名为"神经协方差SDE"的随机微分方程。通过仿真实验,我们验证该SDE与有限网络随机协方差矩阵的分布高度吻合。此外,我们基于激活函数推导出大型成形网络梯度爆炸与梯度消失的充要条件。