Deep kernel processes are a recently introduced class of deep Bayesian models that have the flexibility of neural networks, but work entirely with Gram matrices. They operate by alternately sampling a Gram matrix from a distribution over positive semi-definite matrices, and applying a deterministic transformation. When the distribution is chosen to be Wishart, the model is called a deep Wishart process (DWP). This particular model is of interest because its prior is equivalent to a deep Gaussian process (DGP) prior, but at the same time it is invariant to rotational symmetries, leading to a simpler posterior distribution. Practical inference in the DWP was made possible in recent work ("A variational approximate posterior for the deep Wishart process" Ober and Aitchison 2021a) where the authors used a generalisation of the Bartlett decomposition of the Wishart distribution as the variational approximate posterior. However, predictive performance in that paper was less impressive than one might expect, with the DWP only beating a DGP on a few of the UCI datasets used for comparison. In this paper, we show that further generalising their distribution to allow linear combinations of rows and columns in the Bartlett decomposition results in better predictive performance, while incurring negligible additional computation cost.
翻译:深度核过程是近年来提出的一类深度贝叶斯模型,兼具神经网络的灵活性,但完全基于Gram矩阵运算。其运行机制是:交替从半正定矩阵分布中采样Gram矩阵,并应用确定性变换。当该分布选取为Wishart分布时,该模型称为深度Wishart过程(DWP)。该模型的特殊意义在于其先验等价于深度高斯过程(DGP)先验,同时具有旋转对称不变性,从而得到更简单的后验分布。近期研究(Ober与Aitchison,2021a,“深度Wishart过程的变分近似后验”)通过引入Wishart分布Bartlett分解的推广形式作为变分近似后验,首次实现了DWP的实用推断。然而,该研究的预测性能未达预期:在与DGP的对比中,DWP仅在少数UCI数据集上表现更优。本文证明:通过进一步推广该分布,允许Bartlett分解中行列的线性组合,可在几乎不增加计算成本的前提下显著提升预测性能。