Gaussian processes are flexible, probabilistic, non-parametric models widely used in machine learning and statistics. However, their scalability to large data sets is limited by computational constraints. To overcome these challenges, we propose Vecchia-inducing-points full-scale (VIF) approximations combining the strengths of global inducing points and local Vecchia approximations. Vecchia approximations excel in settings with low-dimensional inputs and moderately smooth covariance functions, while inducing point methods are better suited to high-dimensional inputs and smoother covariance functions. Our VIF approach bridges these two regimes by using an efficient correlation-based neighbor-finding strategy for the Vecchia approximation of the residual process, implemented via a modified cover tree algorithm. We further extend our framework to non-Gaussian likelihoods by introducing iterative methods that substantially reduce computational costs for training and prediction by several orders of magnitudes compared to Cholesky-based computations when using a Laplace approximation. In particular, we propose and compare novel preconditioners and provide theoretical convergence results. Extensive numerical experiments on simulated and real-world data sets show that VIF approximations are both computationally efficient as well as more accurate and numerically stable than state-of-the-art alternatives. All methods are implemented in the open source C++ library GPBoost with high-level Python and R interfaces.
翻译:高斯过程是机器学习和统计中广泛使用的灵活、概率性非参数模型,但其对大数据集的可扩展性受计算限制。为克服这些挑战,我们提出Vecchia-诱导点全尺度(VIF)近似,融合全局诱导点与局部Vecchia近似的优势。Vecchia近似在低维输入和中等平滑协方差函数场景中表现优异,而诱导点方法更适用于高维输入和更平滑的协方差函数。我们的VIF方法通过为残差过程的Vecchia近似设计基于相关性的高效邻居搜索策略(通过改进的覆盖树算法实现),桥接这两种机制。进一步将框架扩展至非高斯似然:引入迭代方法,使用拉普拉斯近似时,相较于基于Cholesky的计算,训练和预测的计算成本可降低数个数量级。我们特别提出并比较了新型预条件子,并给出理论收敛性证明。在模拟和真实数据集上的大量数值实验表明,VIF近似不仅计算高效,而且比最先进的替代方案更准确、数值更稳定。所有方法均在开源C++库GPBoost中实现,并提供高级Python和R接口。