Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We give here a complete solution in the special case of linear networks with output dimension one trained using zero noise Bayesian inference with Gaussian weight priors and mean squared error as a negative log-likelihood. For any training dataset, network depth, and hidden layer widths, we find non-asymptotic expressions for the predictive posterior and Bayesian model evidence in terms of Meijer-G functions, a class of meromorphic special functions of a single complex variable. Through novel asymptotic expansions of these Meijer-G functions, a rich new picture of the joint role of depth, width, and dataset size emerges. We show that linear networks make provably optimal predictions at infinite depth: the posterior of infinitely deep linear networks with data-agnostic priors is the same as that of shallow networks with evidence-maximizing data-dependent priors. This yields a principled reason to prefer deeper networks when priors are forced to be data-agnostic. Moreover, we show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth, elucidating the salutary role of increased depth for model selection. Underpinning our results is a novel emergent notion of effective depth, given by the number of hidden layers times the number of data points divided by the network width; this determines the structure of the posterior in the large-data limit.
翻译:刻画神经网络深度、宽度与数据集规模如何共同影响模型质量,是深度学习理论的核心问题。本文针对输出维度为一的线性网络这一特例给出完整解答,该网络采用零噪声贝叶斯推断,以高斯权重先验和均方误差作为负对数似然。对于任意训练数据集、网络深度及隐藏层宽度,我们以Meijer-G函数(一类单复变量的亚纯特殊函数)形式给出了预测后验和贝叶斯模型证据的非渐近表达式。通过对这些Meijer-G函数的新颖渐近展开,深度、宽度与数据集规模的联合作用呈现出丰富的全新图景。我们证明线性网络在无限深度下可做出可证明的最优预测:具有数据无关先验的无限深线性网络的后验,与采用证据最大化数据相关先验的浅层网络的后验相同。这为先验被迫为数据无关时偏好更深网络提供了原理性依据。此外,我们表明在数据无关先验下,宽线性网络的贝叶斯模型证据在无限深度处达到最大值,揭示了增加深度对模型选择的积极作用。支撑我们结果的是一种新颖的有效深度概念,其定义为隐藏层数乘以数据点数再除以网络宽度;这一概念决定了大数据极限下的后验结构。