Ambient air pollution poses significant health and environmental challenges. Exposure to high concentrations of PM$_{2.5}$ have been linked to increased respiratory and cardiovascular hospital admissions, more emergency department visits and deaths. Traditional air quality monitoring systems such as EPA-certified stations provide limited spatial and temporal data. The advent of low-cost sensors has dramatically improved the granularity of air quality data, enabling real-time, high-resolution monitoring. This study exploits the extensive data from PurpleAir sensors to assess and compare the effectiveness of various statistical and machine learning models in producing accurate hourly PM$_{2.5}$ maps across California. We evaluate traditional geostatistical methods, including kriging and land use regression, against advanced machine learning approaches such as neural networks, random forests, and support vector machines, as well as ensemble model. Our findings enhanced the predictive accuracy of PM2.5 concentration by correcting the bias in PurpleAir data with an ensemble model, which incorporating both spatiotemporal dependencies and machine learning models.
翻译:环境空气污染带来了重大的健康与环境挑战。暴露于高浓度PM$_{2.5}$环境中,已被证实与呼吸系统和心血管疾病住院率上升、急诊就诊次数增加以及死亡率提高相关联。传统的空气质量监测系统(如美国环保署认证的监测站)提供的时空数据有限。低成本传感器的出现显著提升了空气质量数据的时空分辨率,实现了实时、高精度的监测。本研究利用PurpleAir传感器获取的大规模数据,评估并比较了多种统计学与机器学习模型在生成加利福尼亚州小时级PM$_{2.5}$浓度分布图方面的效能。我们评估了传统地统计学方法(包括克里金插值和土地利用回归模型),并将其与先进的机器学习方法(如神经网络、随机森林和支持向量机)以及集成模型进行了对比。通过采用融合时空依赖性与机器学习模型的集成方法校正PurpleAir数据偏差,我们的研究成果提升了PM2.5浓度预测的准确性。