Compactly-supported nonstationary kernels for computing exact Gaussian processes on big data

The Gaussian process (GP) is a widely used probabilistic machine learning method for stochastic function approximation, stochastic modeling, and analyzing real-world measurements of nonlinear processes. Unlike many other machine learning methods, GPs include an implicit characterization of uncertainty, making them extremely useful across many areas of science, technology, and engineering. Traditional implementations of GPs involve stationary kernels (also termed covariance functions) that limit their flexibility and exact methods for inference that prevent application to data sets with more than about ten thousand points. Modern approaches to address stationarity assumptions generally fail to accommodate large data sets, while all attempts to address scalability focus on approximating the Gaussian likelihood, which can involve subjectivity and lead to inaccuracies. In this work, we explicitly derive an alternative kernel that can discover and encode both sparsity and nonstationarity. We embed the kernel within a fully Bayesian GP model and leverage high-performance computing resources to enable the analysis of massive data sets. We demonstrate the favorable performance of our novel kernel relative to existing exact and approximate GP methods across a variety of synthetic data examples. Furthermore, we conduct space-time prediction based on more than one million measurements of daily maximum temperature and verify that our results outperform state-of-the-art methods in the Earth sciences. More broadly, having access to exact GPs that use ultra-scalable, sparsity-discovering, nonstationary kernels allows GP methods to truly compete with a wide variety of machine learning methods.

翻译：高斯过程（GP）是一种广泛使用的概率机器学习方法，适用于随机函数逼近、随机建模以及分析非线性过程的实际测量数据。与许多其他机器学习方法不同，高斯过程包含对不确定性的隐式表征，这使得其在科学、技术和工程领域的众多应用中极具价值。传统的高斯过程实现采用平稳核函数（亦称协方差函数），这限制了其灵活性；同时，其精确推理方法难以适用于超过约一万个数据点的数据集。现代处理非平稳性假设的方法通常无法适应大规模数据集，而所有提升可扩展性的尝试都集中于近似高斯似然函数，这可能引入主观性并导致不准确性。在本研究中，我们显式推导出一种能够发现并编码稀疏性与非平稳性的替代核函数。我们将该核函数嵌入完全贝叶斯高斯过程模型中，并利用高性能计算资源实现对海量数据集的分析。通过一系列合成数据实验，我们证明了相较于现有的精确与近似高斯过程方法，我们所提出的新型核函数具有优越性能。此外，我们基于超过一百万条日最高温度测量数据进行时空预测，并验证了我们的结果优于地球科学领域的最先进方法。更广泛而言，获得使用超可扩展、稀疏性发现、非平稳核函数的精确高斯过程，使得高斯过程方法能够真正与多种多样的机器学习方法相竞争。