Gaussian processes are widely used as priors for unknown functions in statistics and machine learning. To achieve computationally feasible inference for large datasets, a popular approach is the Vecchia approximation, which is an ordered conditional approximation of the data vector that implies a sparse Cholesky factor of the precision matrix. The ordering and sparsity pattern are typically determined based on Euclidean distance of the inputs or locations corresponding to the data points. Here, we propose instead to use a correlation-based distance metric, which implicitly applies the Vecchia approximation in a suitable transformed input space. The correlation-based algorithm can be carried out in quasilinear time in the size of the dataset, and so it can be applied even for iterative inference on unknown parameters in the correlation structure. The correlation-based approach has two advantages for complex settings: It can result in more accurate approximations, and it offers a simple, automatic strategy that can be applied to any covariance, even when Euclidean distance is not applicable. We demonstrate these advantages in several settings, including anisotropic, nonstationary, multivariate, and spatio-temporal processes. We also illustrate our method on multivariate spatio-temporal temperature fields produced by a regional climate model.
翻译:高斯过程在统计学和机器学习中被广泛用作未知函数的先验分布。为了实现大规模数据集上的计算可行推断,一种常用方法是维基亚近似,该方法对数据向量进行有序条件近似,从而得到精度矩阵的稀疏乔列斯基因子。通常,排序和稀疏模式基于数据点对应输入或位置的欧几里得距离来确定。本文提出改用基于相关性的距离度量,该方法在合适的变换输入空间中隐式应用维基亚近似。基于相关性的算法可在数据集规模的拟线性时间内完成,因此即使对相关结构中未知参数进行迭代推断也能适用。该方法在复杂场景下具有两个优势:可得到更精确的近似,并提供了一种简单、自动化的策略,可应用于任何协方差函数,甚至在欧几里得距离不适用的情况下仍能奏效。我们通过各向异性、非平稳、多元及时空过程等多种场景验证了这些优势,并利用区域气候模型生成的多元时空温度场对方法进行了说明。