Ambient air pollution remains a critical issue in the United Kingdom, where data on air pollution concentrations form the foundation for interventions aimed at improving air quality. However, the current air pollution monitoring station network in the UK is characterized by spatial sparsity, heterogeneous placement, and frequent temporal data gaps, often due to issues such as power outages. We introduce a scalable data-driven supervised machine learning model framework designed to address temporal and spatial data gaps by filling missing measurements. This approach provides a comprehensive dataset for England throughout 2018 at a 1kmx1km hourly resolution. Leveraging machine learning techniques and real-world data from the sparsely distributed monitoring stations, we generate 355,827 synthetic monitoring stations across the study area, yielding data valued at approximately \pounds70 billion. Validation was conducted to assess the model's performance in forecasting, estimating missing locations, and capturing peak concentrations. The resulting dataset is of particular interest to a diverse range of stakeholders engaged in downstream assessments supported by outdoor air pollution concentration data for NO2, O3, PM10, PM2.5, and SO2. This resource empowers stakeholders to conduct studies at a higher resolution than was previously possible.
翻译:环境空气污染仍是英国的一个关键问题,空气污染浓度数据构成了改善空气质量干预措施的基础。然而,英国当前的空气污染监测站网络存在空间稀疏性、布局异质性以及频繁的时间数据缺失,这通常由断电等问题所致。我们引入了一个可扩展的数据驱动监督式机器学习模型框架,旨在通过填补缺失测量值来解决时间和空间数据缺口。该方法提供了2018年英格兰地区以1公里×1公里小时分辨率呈现的综合数据集。利用机器学习技术和来自稀疏分布监测站的真实世界数据,我们在研究区域内生成了355,827个合成监测站,产生数据价值约为700亿英镑。我们进行了验证,以评估模型在预测、估算缺失位置以及捕捉峰值浓度方面的性能。生成的数据集对于参与基于NO2、O3、PM10、PM2.5和SO2室外空气污染浓度数据进行的下游评估的各利益相关方尤其具有价值。这一资源使利益相关方能够以比以往更高的分辨率开展研究。