Evaluating models fit to data with internal spatial structure requires specific cross-validation (CV) approaches, because randomly selecting assessment data may produce assessment sets that are not truly independent of data used to train the model. Many spatial CV methodologies have been proposed to address this by forcing models to extrapolate spatially when predicting the assessment set. However, to date there exists little guidance on which methods yield the most accurate estimates of model performance. We conducted simulations to compare model performance estimates produced by five common CV methods fit to spatially structured data. We found spatial CV approaches generally improved upon resubstitution and V-fold CV estimates, particularly when approaches which combined assessment sets of spatially conjunct observations with spatial exclusion buffers. To facilitate use of these techniques, we introduce the `spatialsample` package which provides tooling for performing spatial CV as part of the broader tidymodels modeling framework.
翻译:对具有内部空间结构的数据进行模型评估时,需要采用特定的交叉验证(CV)方法,因为随机选择评估数据可能导致评估集与训练模型所用的数据并非真正独立。为此,学界提出了多种空间交叉验证方法,通过迫使模型在评估集预测时进行空间外推来解决这一问题。然而,至今尚缺乏关于哪些方法能产生最准确模型性能估计的明确指导。我们通过模拟实验,比较了五种常见CV方法对空间结构化数据拟合后的模型性能估计结果。研究发现,空间交叉验证方法普遍优于重代换和V折CV估计,特别是当结合空间连续观测的评估集与空间排除缓冲区时效果更佳。为促进这些技术的应用,我们推出了`spatialsample`包,该工具可在更广泛的tidymodels建模框架中支持空间交叉验证的实施。