Spatial epidemiology identifies the drivers of elevated population-level disease risks, using disease counts, exposures and known confounders at the areal unit level. Poisson regression models are typically used for inference, which incorporate a linear/additive regression component and allow for unmeasured confounding via a set of spatially autocorrelated random effects. This approach requires the confounder interactions and their functional relationships with disease risk to be specified in advance, rather than being learned from the data. Therefore, this paper proposes the SPAR-Forest-ERF algorithm, which is the first fusion of random forests for capturing non-linear and interacting confounder-response effects with Bayesian spatial autocorrelation models that can estimate interpretable exposure response functions (ERF) with full uncertainty quantification. Methodologically, we extend existing methods set in a prediction context by propagating uncertainty between both the ML and statistical models, developing a new stopping criteria designed to ensure the stability of the primary inferential target, and incorporating a range of different ERFs for maximum model flexibility. This methodology is motivated by a new study quantifying the impact of air pollution concentrations on self-rated health in Scotland, using data from the recently released 2022 national census.
翻译:空间流行病学利用区域单元层面的疾病计数、暴露量和已知混杂因素,识别导致人群疾病风险升高的驱动因素。通常采用泊松回归模型进行推断,该模型结合线性/加性回归分量,并通过一组空间自相关随机效应来容纳未测量的混杂因素。该方法要求预先设定混杂因素的交互作用及其与疾病风险之间的函数关系,而非从数据中学习得出。因此,本文提出SPAR-Forest-ERF算法,首次将用于捕捉非线性及交互性混杂-响应效应的随机森林,与能够通过完整不确定性量化估计可解释暴露响应函数(ERF)的贝叶斯空间自相关模型相融合。在方法论上,我们通过机器学习模型与统计模型之间的不确定性传递,扩展了现有基于预测情境的方法;开发了旨在确保主要推断目标稳定性的新停止准则;并整合了多种不同的ERF以实现最大模型灵活性。该方法的提出源于一项新研究,该研究利用近期发布的2022年全国人口普查数据,量化了苏格兰地区空气污染浓度对自评健康状况的影响。