Recent advances in data collection technologies have led to the emergence of massive spatial datasets, with measurements obtained at millions of spatial locations. Geostatistical models typically employ Gaussian processes (GPs) to capture spatial dependence, but standard GP fitting becomes prohibitive at such scales. A promising solution is optimal subsampling, where a subset of locations is selected that optimizes a criterion. In this study, we propose a randomized exchange algorithm for subsampling (REX-SUB) which efficiently selects small subsamples that minimize prediction errors in the fitted spatial GP models. To further improve computational efficiency, we embed a scalable Vecchia approximation to the GP's joint likelihood, which takes advantage of sparsity in the precision matrix to enable fast inference on the selected subsamples. Through a simulation study and an application to a remotely sensed precipitable water dataset, we show that REX-SUB yields lower mean squared prediction errors and interval scores compared to competing subsampling strategies.
翻译:近年来,数据采集技术的进步催生了海量空间数据集的出现,这些数据在数百万个空间位置上获取测量值。地质统计模型通常采用高斯过程(GPs)来捕捉空间依赖性,但标准GP拟合在此类规模下变得不可行。一种有前景的解决方案是最优子采样,即选择能够优化特定准则的空间位置子集。在本研究中,我们提出了一种用于子采样的随机交换算法(REX-SUB),该算法能够高效地选择子样本,从而最小化拟合后空间GP模型中的预测误差。为进一步提升计算效率,我们嵌入了可扩展的Vecchia近似方法来处理GP的联合似然函数,该方法利用精度矩阵的稀疏性,实现对所选子样本的快速推断。通过模拟研究以及应用于遥感可降水水汽数据集,我们证明与竞争性子采样策略相比,REX-SUB在均方根预测误差和区间评分方面表现更优。