The successes of modern deep machine learning methods are founded on their ability to transform inputs across multiple layers to build good high-level representations. It is therefore critical to understand this process of representation learning. However, standard theoretical approaches (formally NNGPs) involving infinite width limits eliminate representation learning. We therefore develop a new infinite width limit, the Bayesian representation learning limit, that exhibits representation learning mirroring that in finite-width models, yet at the same time, retains some of the simplicity of standard infinite-width limits. In particular, we show that Deep Gaussian processes (DGPs) in the Bayesian representation learning limit have exactly multivariate Gaussian posteriors, and the posterior covariances can be obtained by optimizing an interpretable objective combining a log-likelihood to improve performance with a series of KL-divergences which keep the posteriors close to the prior. We confirm these results experimentally in wide but finite DGPs. Next, we introduce the possibility of using this limit and objective as a flexible, deep generalisation of kernel methods, that we call deep kernel machines (DKMs). Like most naive kernel methods, DKMs scale cubically in the number of datapoints. We therefore use methods from the Gaussian process inducing point literature to develop a sparse DKM that scales linearly in the number of datapoints. Finally, we extend these approaches to NNs (which have non-Gaussian posteriors) in the Appendices.
翻译:现代深度机器学习方法的成功,根植于其通过多层变换将输入转化为高质量高层表示的能力。因此,理解这一表示学习过程至关重要。然而,涉及无限宽度极限的标准理论方法(形式上的NNGP)消除了表示学习效应。为此,我们提出一种新的无限宽度极限——贝叶斯表示学习极限,该极限保留了有限宽度模型中的表示学习特性,同时兼具标准无限宽度极限的简洁性。具体而言,我们证明在贝叶斯表示学习极限下,深度高斯过程(DGP)的后验分布严格服从多元高斯分布,其后验协方差可通过优化一个可解释的目标函数获得:该函数融合了提升性能的对数似然项与使后验分布贴近先验的KL散度序列。我们通过有限宽度但深度较大的DGP实验验证了这一结果。进而,我们提出将该极限与目标函数作为核方法的灵活深层泛化框架,称为深度核机器(DKM)。与大多数朴素核方法类似,DKM的计算复杂度与数据点数量呈三次方关系。为此,我们借鉴高斯过程诱导点文献中的方法,开发了稀疏DKM,其计算复杂度与数据点数量呈线性关系。最后,我们在附录中将上述方法扩展至非高斯后验分布的神经网络。