A theory of representation learning gives a deep generalisation of kernel methods

The successes of modern deep machine learning methods are founded on their ability to transform inputs across multiple layers to build good high-level representations. It is therefore critical to understand this process of representation learning. However, standard theoretical approaches (formally NNGPs) involving infinite width limits eliminate representation learning. We therefore develop a new infinite width limit, the Bayesian representation learning limit, that exhibits representation learning mirroring that in finite-width models, yet at the same time, retains some of the simplicity of standard infinite-width limits. In particular, we show that Deep Gaussian processes (DGPs) in the Bayesian representation learning limit have exactly multivariate Gaussian posteriors, and the posterior covariances can be obtained by optimizing an interpretable objective combining a log-likelihood to improve performance with a series of KL-divergences which keep the posteriors close to the prior. We confirm these results experimentally in wide but finite DGPs. Next, we introduce the possibility of using this limit and objective as a flexible, deep generalisation of kernel methods, that we call deep kernel machines (DKMs). Like most naive kernel methods, DKMs scale cubically in the number of datapoints. We therefore use methods from the Gaussian process inducing point literature to develop a sparse DKM that scales linearly in the number of datapoints. Finally, we extend these approaches to NNs (which have non-Gaussian posteriors) in the Appendices.

翻译：现代深度机器学习方法的成功建立在它们通过多层转换输入以构建良好高层表示的能力之上。因此，理解这一表示学习过程至关重要。然而，涉及无限宽度极限的标准理论方法（正式地称为神经正切核）消除了表示学习。为此，我们发展了一种新的无限宽度极限——贝叶斯表示学习极限，它既展现出与有限宽度模型相同的表示学习特性，又保留了标准无限宽度极限的某些简洁性。特别地，我们表明，处于贝叶斯表示学习极限下的深度高斯过程具有精确的多变量高斯后验分布，并且其协方差可通过优化一个可解释的目标函数获得，该目标函数结合了用于提升性能的对数似然和一系列使后验接近先验的KL散度。我们在宽但有限的深度高斯过程中通过实验验证了这些结果。接着，我们将这一极限和目标函数引入作为核方法的灵活深度泛化，称之为深度核机器。与大多数朴素核方法类似，深度核机器在数据点数量上的计算复杂度呈三次方增长。因此，我们采用高斯过程诱导点文献中的方法开发了一种稀疏深度核机器，其计算复杂度与数据点数量呈线性关系。最后，我们在附录中将这些方法扩展到具有非高斯后验分布的神经网络。