Gaussian and discrete non-Gaussian spatial datasets are prevalent across many fields such as public health, ecology, geosciences, and social sciences. Bayesian spatial generalized linear mixed models (SGLMMs) are a flexible class of models designed for these data, but SGLMMs do not scale well, even to moderately large datasets. State-of-the-art scalable SGLMMs (i.e., basis representations or sparse covariance/precision matrices) require posterior sampling via Markov chain Monte Carlo (MCMC), which can be prohibitive for large datasets. While variational Bayes (VB) have been extended to SGLMMs, their focus has primarily been on smaller spatial datasets. In this study, we propose two computationally efficient VB approaches for modeling moderate-sized and massive (millions of locations) Gaussian and discrete non-Gaussian spatial data. Our scalable VB method embeds semi-parametric approximations for the latent spatial random processes and parallel computing offered by modern high-performance computing systems. Our approaches deliver nearly identical inferential and predictive performance compared to 'gold standard' methods but achieve computational speedups of up to 1000x. We demonstrate our approaches through a comparative numerical study as well as applications to two real-world datasets. Our proposed VB methodology enables practitioners to model millions of non-Gaussian spatial observations using a standard laptop within a short timeframe.
翻译:高斯和离散非高斯空间数据集在公共卫生、生态学、地球科学和社会科学等多个领域普遍存在。贝叶斯空间广义线性混合模型(SGLMMs)是为这些数据设计的一类灵活模型,但SGLMMs即使在中等规模数据集上也难以扩展。最先进的可扩展SGLMMs(如基表示或稀疏协方差/精度矩阵)需要通过马尔可夫链蒙特卡罗(MCMC)进行后验采样,这在大规模数据集上可能难以实现。尽管变分贝叶斯(VB)已被扩展到SGLMMs,但其主要关注较小的空间数据集。在本研究中,我们提出了两种计算高效的VB方法,用于建模中等规模和大型(数百万个位置)高斯及离散非高斯空间数据。我们的可扩展VB方法嵌入了对潜在空间随机过程的半参数近似,并利用了现代高性能计算系统提供的并行计算能力。与“黄金标准”方法相比,我们的方法在推断和预测性能上几乎相同,但实现了高达1000倍的计算加速。我们通过比较数值研究以及两个真实世界数据集的应用展示了所提出方法的有效性。我们提出的VB方法使实践者能够在短时间内使用标准笔记本电脑建模数百万个非高斯空间观测数据。