Citizen science mobilises many observers and gathers huge datasets but often without strict sampling protocols, which results in observation biases due to heterogeneity in sampling effort that can lead to biased statistical inferences. We develop a spatiotemporal Bayesian hierarchical model for bias-corrected estimation of arrival dates of the first migratory bird individuals at a breeding site. Higher sampling effort could be correlated with earlier observed dates. We implement data fusion of two citizen-science datasets with sensibly different protocols (BBS, eBird) and map posterior distributions of the latent process, which contains four spatial components with Gaussian process priors: species niche; sampling effort; position and scale parameters of annual first date of arrival. The data layer includes four response variables: counts of observed eBird locations (Poisson); presence-absence at observed eBird locations (Binomial); BBS occurrence counts (Poisson); first arrival dates (Generalized Extreme-Value). We devise a Markov Chain Monte Carlo scheme and check by simulation that the latent process components are identifiable. We apply our model to several migratory bird species in the northeastern US for 2001--2021. The sampling effort is shown to significantly modulate the observed first arrival date. We exploit this relationship to effectively debias predictions of the true first arrival dates.
翻译:公民科学动员大量观测者并收集了庞大的数据集,但通常缺乏严格的采样协议,导致因采样工作异质性而产生的观测偏差,进而可能引发有偏的统计推断。我们开发了一种时空贝叶斯分层模型,用于对迁徙鸟类个体在繁殖地的首次到达日期进行偏差校正估计。较高的采样工作可能与更早的观测日期相关。我们实现了两个具有明显不同协议(BBS、eBird)的公民科学数据集的融合,并绘制了包含高斯过程先验的四个空间分量(物种生态位、采样工作、年首次到达日期的位置和尺度参数)的潜在过程的后验分布。数据层包含四个响应变量:观测到的eBird位置计数(泊松分布)、观测到的eBird位置存在-缺失数据(二项分布)、BBS出现次数(泊松分布)以及首次到达日期(广义极值分布)。我们设计了一种马尔可夫链蒙特卡洛方案,并通过模拟验证潜在过程分量的可识别性。我们将模型应用于2001-2021年美国东北部多种迁徙鸟类物种,结果表明采样工作显著调节了观测到的首次到达日期。我们利用这一关系有效消除对真实首次到达日期预测的偏差。