Binary geospatial data is commonly analyzed with generalized linear mixed models, specified with a linear fixed covariate effect and a Gaussian Process (GP)-distributed spatial random effect, relating to the response via a link function. The assumption of linear covariate effects is severely restrictive. Random Forests (RF) are increasingly being used for non-linear modeling of spatial data, but current extensions of RF for binary spatial data depart the mixed model setup, relinquishing inference on the fixed effects and other advantages of using GP. We propose RF-GP, using Random Forests for estimating the non-linear covariate effect and Gaussian Processes for modeling the spatial random effects directly within the generalized mixed model framework. We observe and exploit equivalence of Gini impurity measure and least squares loss to propose an extension of RF for binary data that accounts for the spatial dependence. We then propose a novel link inversion algorithm that leverages the properties of GP to estimate the covariate effects and offer spatial predictions. RF-GP outperforms existing RF methods for estimation and prediction in both simulated and real-world data. We establish consistency of RF-GP for a general class of $\beta$-mixing binary processes that includes common choices like spatial Mat\'ern GP and autoregressive processes.
翻译:二值地理空间数据通常使用广义线性混合模型进行分析,该模型包含线性固定协变量效应和高斯过程(GP)分布的空间随机效应,并通过连接函数与响应变量关联。线性协变量效应的假设存在严重局限性。随机森林(RF)越来越多地被用于空间数据的非线性建模,但当前针对二值空间数据的RF扩展背离了混合模型框架,从而放弃了对固定效应的推断及使用GP的其他优势。我们提出RF-GP方法,在广义混合模型框架内,利用随机森林估计非线性协变量效应,同时直接使用高斯过程对空间随机效应建模。我们观察到基尼不纯度度量与最小二乘损失之间的等价性,并利用该性质提出一种考虑空间依赖性的二值数据RF扩展方法。随后提出一种新型连接函数逆变换算法,利用GP性质估计协变量效应并提供空间预测。在模拟数据和真实数据中,RF-GP在估计和预测性能上均优于现有RF方法。我们证明了RF-GP对一类包含空间Matérn GP和自回归过程等常见选择的β混合二值过程具有一致性。