Recent developments in applications of artificial neural networks with over $n=10^{14}$ parameters make it extremely important to study the large $n$ behaviour of such networks. Most works studying wide neural networks have focused on the infinite width $n \to +\infty$ limit of such networks and have shown that, at initialization, they correspond to Gaussian processes. In this work we will study their behavior for large, but finite $n$. Our main contributions are the following: (1) The computation of the corrections to Gaussianity in terms of an asymptotic series in $n^{-\frac{1}{2}}$. The coefficients in this expansion are determined by the statistics of parameter initialization and by the activation function. (2) Controlling the evolution of the outputs of finite width $n$ networks, during training, by computing deviations from the limiting infinite width case (in which the network evolves through a linear flow). This improves previous estimates and yields sharper decay rates for the (finite width) NTK in terms of $n$, valid during the entire training procedure. As a corollary, we also prove that, with arbitrarily high probability, the training of sufficiently wide neural networks converges to a global minimum of the corresponding quadratic loss function. (3) Estimating how the deviations from Gaussianity evolve with training in terms of $n$. In particular, using a certain metric in the space of measures we find that, along training, the resulting measure is within $n^{-\frac{1}{2}}(\log n)^{1+}$ of the time dependent Gaussian process corresponding to the infinite width network (which is explicitly given by precomposing the initial Gaussian process with the linear flow corresponding to training in the infinite width limit).
翻译:近年来,具有超过 $n=10^{14}$ 个参数的人工神经网络应用的发展,使得研究此类网络在大 $n$ 下的行为变得极其重要。大多数研究宽神经网络的工作聚焦于这些网络的无限宽度极限 $n \to +\infty$,并表明在初始化时,它们对应于高斯过程。在本工作中,我们将研究它们在有限但大的 $n$ 下的行为。我们的主要贡献如下:(1) 通过 $n^{-\frac{1}{2}}$ 的渐近级数形式计算对高斯性的修正。该展开中的系数由参数初始化的统计量和激活函数决定。(2) 通过计算与无限宽度极限情况(其网络通过线性流演化)的偏差,控制训练期间有限宽度 $n$ 网络输出的演化。这改进了先前的估计,并给出了在整个训练过程中有效的关于 $n$ 的(有限宽度)NTK 的更快衰减率。作为推论,我们还证明,以任意高的概率,足够宽神经网络的训练收敛到相应二次损失函数的全局最小值。(3) 估计随训练演化的高斯性偏差如何依赖于 $n$。特别地,利用测度空间中的某种度量,我们发现在训练过程中,结果测度与对应于无限宽度网络(通过将初始高斯过程与无限宽度极限下训练对应的线性流预复合而显式给出)的时间依赖高斯过程相差 $n^{-\frac{1}{2}}(\log n)^{1+}$。