Gaussian processes (GPs) are flexible, probabilistic, non-parametric models widely employed in various fields such as spatial statistics, time series analysis, and machine learning. A drawback of Gaussian processes is their computational cost having $\mathcal{O}(N^3)$ time and $\mathcal{O}(N^2)$ memory complexity which makes them prohibitive for large datasets. Numerous approximation techniques have been proposed to address this limitation. In this work, we systematically compare the accuracy of different Gaussian process approximations concerning marginal likelihood evaluation, parameter estimation, and prediction taking into account the time required to achieve a certain accuracy. We analyze this trade-off between accuracy and runtime on multiple simulated and large-scale real-world datasets and find that Vecchia approximations consistently emerge as the most accurate in almost all experiments. However, for certain real-world data sets, low-rank inducing point-based methods, i.e., full-scale and modified predictive process approximations, can provide more accurate predictive distributions for extrapolation.
翻译:高斯过程(GPs)是一种灵活的概率性非参数模型,广泛应用于空间统计、时间序列分析和机器学习等多个领域。高斯过程的一个缺点在于其计算成本较高,具有 $\mathcal{O}(N^3)$ 的时间复杂度和 $\mathcal{O}(N^2)$ 的内存复杂度,这使得它们难以应用于大型数据集。针对这一局限,学界已提出了多种近似技术。在本研究中,我们系统性地比较了不同高斯过程近似方法在边缘似然评估、参数估计和预测方面的精度,同时考虑了达到特定精度所需的时间。我们在多个模拟数据集和大规模真实世界数据集上分析了这种精度与运行时间之间的权衡,发现 Vecchia 近似方法在几乎所有实验中始终表现出最高的精度。然而,对于某些真实世界数据集,基于低秩诱导点的方法(即全尺度修正预测过程近似)能够为外推任务提供更准确的预测分布。