In areal unit data with missing or suppressed data, it desirable to create models that are able to predict observations that are not available. Traditional statistical methods achieve this through Bayesian hierarchical models that can capture the unexplained residual spatial autocorrelation through conditional autoregressive (CAR) priors, such that they can make predictions at geographically related spatial locations. In contrast, typical machine learning approaches such as random forests ignore this residual autocorrelation, and instead base predictions on complex non-linear feature-target relationships. In this paper, we propose CAR-Forest, a novel spatial prediction algorithm that combines the best features of both approaches by fusing them together. By iteratively refitting a random forest combined with a Bayesian CAR model in one algorithm, CAR-Forest can incorporate flexible feature-target relationships while still accounting for the residual spatial autocorrelation. Our results, based on a Scottish housing price data set, show that CAR-Forest outperforms Bayesian CAR models, random forests, and the state-of-the-art hybrid approach, geographically weighted random forest, providing a state-of-the-art framework for small-area spatial prediction.
翻译:在存在缺失或抑制数据的区域单元数据中,构建能够预测不可用观测值的模型至关重要。传统统计方法通过贝叶斯层次模型实现这一目标,该模型利用条件自回归先验来捕捉未解释的残差空间自相关,从而可对地理相关空间位置进行预测。相比之下,随机森林等典型机器学习方法忽略了这种残差自相关,而是基于复杂的非线性特征-目标关系进行预测。本文提出CAR-Forest,一种新型空间预测算法,通过融合两类方法的优势特征,将随机森林与贝叶斯CAR模型迭代重拟合于同一算法中,使得CAR-Forest既能纳入灵活的特征-目标关系,又能同时考虑残差空间自相关。基于苏格兰房价数据集的实验结果表明,CAR-Forest在性能上优于贝叶斯CAR模型、随机森林及当前最先进的混合方法——地理加权随机森林,为小区域空间预测提供了前沿框架。