Wide neural networks: From non-gaussian random fields at initialization to the NTK geometry of training

Recent developments in applications of artificial neural networks with over $n=10^{14}$ parameters make it extremely important to study the large $n$ behaviour of such networks. Most works studying wide neural networks have focused on the infinite width $n \to +\infty$ limit of such networks and have shown that, at initialization, they correspond to Gaussian processes. In this work we will study their behavior for large, but finite $n$. Our main contributions are the following: (1) The computation of the corrections to Gaussianity in terms of an asymptotic series in $n^{-\frac{1}{2}}$. The coefficients in this expansion are determined by the statistics of parameter initialization and by the activation function. (2) Controlling the evolution of the outputs of finite width $n$ networks, during training, by computing deviations from the limiting infinite width case (in which the network evolves through a linear flow). This improves previous estimates and yields sharper decay rates for the (finite width) NTK in terms of $n$, valid during the entire training procedure. As a corollary, we also prove that, with arbitrarily high probability, the training of sufficiently wide neural networks converges to a global minimum of the corresponding quadratic loss function. (3) Estimating how the deviations from Gaussianity evolve with training in terms of $n$. In particular, using a certain metric in the space of measures we find that, along training, the resulting measure is within $n^{-\frac{1}{2}}(\log n)^{1+}$ of the time dependent Gaussian process corresponding to the infinite width network (which is explicitly given by precomposing the initial Gaussian process with the linear flow corresponding to training in the infinite width limit).

翻译：近年来，具有超过 $n=10^{14}$ 个参数的人工神经网络应用的发展，使得研究此类网络在大 $n$ 下的行为变得极其重要。大多数研究宽神经网络的工作聚焦于这些网络的无限宽度极限 $n \to +\infty$，并表明在初始化时，它们对应于高斯过程。在本工作中，我们将研究它们在有限但大的 $n$ 下的行为。我们的主要贡献如下：(1) 通过 $n^{-\frac{1}{2}}$ 的渐近级数形式计算对高斯性的修正。该展开中的系数由参数初始化的统计量和激活函数决定。(2) 通过计算与无限宽度极限情况（其网络通过线性流演化）的偏差，控制训练期间有限宽度 $n$ 网络输出的演化。这改进了先前的估计，并给出了在整个训练过程中有效的关于 $n$ 的（有限宽度）NTK 的更快衰减率。作为推论，我们还证明，以任意高的概率，足够宽神经网络的训练收敛到相应二次损失函数的全局最小值。(3) 估计随训练演化的高斯性偏差如何依赖于 $n$。特别地，利用测度空间中的某种度量，我们发现在训练过程中，结果测度与对应于无限宽度网络（通过将初始高斯过程与无限宽度极限下训练对应的线性流预复合而显式给出）的时间依赖高斯过程相差 $n^{-\frac{1}{2}}(\log n)^{1+}$。

相关内容

高斯过程

关注 6

高斯过程（Gaussian Process, GP）是概率论和数理统计中随机过程（stochastic process）的一种，是一系列服从正态分布的随机变量（random variable）在一指数集（index set）内的组合。高斯过程中任意随机变量的线性组合都服从正态分布，每个有限维分布都是联合正态分布，且其本身在连续指数集上的概率密度函数即是所有随机变量的高斯测度，因此被视为联合正态分布的无限维广义延伸。高斯过程由其数学期望和协方差函数完全决定，并继承了正态分布的诸多性质

加速图神经网络推理，121页ppt，普林斯顿大学JAVIER DUARTE主讲

专知会员服务

33+阅读 · 2022年6月13日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【硬核书】树与网络上的概率，716页pdf

专知会员服务

77+阅读 · 2021年12月8日

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

专知会员服务

26+阅读 · 2020年3月26日