Gaussian and discrete non-Gaussian spatial datasets are prevalent across many fields such as public health, ecology, geosciences, and social sciences. Bayesian spatial generalized linear mixed models (SGLMMs) are a flexible class of models designed for these data, but SGLMMs do not scale well, even to moderately large datasets. State-of-the-art scalable SGLMMs (i.e., basis representations or sparse covariance/precision matrices) require posterior sampling via Markov chain Monte Carlo (MCMC), which can be prohibitive for large datasets. While variational Bayes (VB) have been extended to SGLMMs, their focus has primarily been on smaller spatial datasets. In this study, we propose two computationally efficient VB approaches for modeling moderate-sized and massive (millions of locations) Gaussian and discrete non-Gaussian spatial data. Our scalable VB method embeds semi-parametric approximations for the latent spatial random processes and parallel computing offered by modern high-performance computing systems. Our approaches deliver nearly identical inferential and predictive performance compared to 'gold standard' methods but achieve computational speedups of up to 1000x. We demonstrate our approaches through a comparative numerical study as well as applications to two real-world datasets. Our proposed VB methodology enables practitioners to model millions of non-Gaussian spatial observations using a standard laptop within a short timeframe.
翻译:高斯分布和离散非高斯空间数据集在公共卫生、生态学、地球科学和社会科学等众多领域广泛存在。贝叶斯空间广义线性混合模型(SGLMMs)是为这类数据设计的一类灵活模型,但其可扩展性较差,即便对中等规模的数据集也难以有效处理。当前最先进的可扩展SGLMMs(例如基函数表示或稀疏协方差/精度矩阵)需要通过马尔可夫链蒙特卡洛(MCMC)进行后验采样,这对大规模数据集而言可能计算成本过高。尽管变分贝叶斯(VB)方法已扩展至SGLMMs,但先前的研究主要聚焦于较小规模的空间数据集。本研究提出两种计算高效的VB方法,用于建模中等规模及海量(百万级位置点)的高斯分布与离散非高斯空间数据。我们的可扩展VB方法融合了潜空间随机过程的半参数近似方法,并结合现代高性能计算系统提供的并行计算能力。与“黄金标准”方法相比,我们的方法在推断与预测性能上几乎一致,但计算速度提升高达1000倍。我们通过对比数值研究及两个真实世界数据集的应用验证了所提方法的有效性。本研究的VB方法使研究者能够在标准笔记本电脑上短时间内对数百万非高斯空间观测数据进行建模。