This article studies the infinite-width limit of deep feedforward neural networks whose weights are dependent, and modelled via a mixture of Gaussian distributions. Each hidden node of the network is assigned a nonnegative random variable that controls the variance of the outgoing weights of that node. We make minimal assumptions on these per-node random variables: they are iid and their sum, in each layer, converges to some finite random variable in the infinite-width limit. Under this model, we show that each layer of the infinite-width neural network can be characterised by two simple quantities: a non-negative scalar parameter and a L\'evy measure on the positive reals. If the scalar parameters are strictly positive and the L\'evy measures are trivial at all hidden layers, then one recovers the classical Gaussian process (GP) limit, obtained with iid Gaussian weights. More interestingly, if the L\'evy measure of at least one layer is non-trivial, we obtain a mixture of Gaussian processes (MoGP) in the large-width limit. The behaviour of the neural network in this regime is very different from the GP regime. One obtains correlated outputs, with non-Gaussian distributions, possibly with heavy tails. Additionally, we show that, in this regime, the weights are compressible, and some nodes have asymptotically non-negligible contributions, therefore representing important hidden features. Many sparsity-promoting neural network models can be recast as special cases of our approach, and we discuss their infinite-width limits; we also present an asymptotic analysis of the pruning error. We illustrate some of the benefits of the MoGP regime over the GP regime in terms of representation learning and compressibility on simulated, MNIST and Fashion MNIST datasets.
翻译:本文研究了权重具有依赖性且通过混合高斯分布建模的深度前馈神经网络的无限宽极限。网络中的每个隐藏节点被分配一个非负随机变量,用于控制该节点输出权重的方差。我们对这些逐节点随机变量做出最小化假设:它们独立同分布,且在每一层中,其和在无限宽极限下收敛至某个有限随机变量。在此模型下,我们证明无限宽神经网络的每一层可由两个简单量刻画:一个非负标量参数和一个正实数上的莱维测度。若标量参数严格为正且所有隐藏层的莱维测度均为平凡测度,则恢复出经典高斯过程极限(即采用独立同分布高斯权重所获得的极限)。更有趣的是,若至少一层的莱维测度非平凡,则在大宽度极限下获得混合高斯过程。该机制下的神经网络行为与高斯过程机制截然不同:输出具有相关性且服从非高斯分布,可能伴随重尾特征。此外,我们证明在该机制下权重是可压缩的,且部分节点具有渐近不可忽略的贡献,因而代表了重要的隐藏特征。许多促进稀疏性的神经网络模型可作为本方法的特例,我们讨论了其无限宽极限,并给出了剪枝误差的渐近分析。通过模拟数据集、MNIST和Fashion MNIST数据集,我们展示了混合高斯过程机制相较于高斯过程机制在表示学习与可压缩性方面的若干优势。