Atmospheric visibility is a critical variable for transportation safety and air quality management, however, accurate prediction remains challenging due to the complex interactions between meteorological conditions and air pollutants, as well as the rarity of low-visibility events. This study introduces a machine learning framework to nowcast visibility in six major South Korean cities. To handle the imbalance in the 2018-2020 training data, we applied the Synthetic Minority Over-sampling Technique with Nominal and Continuous (SMOTENC) and Conditional Tabular Generative Adversarial Network (CTGAN). An ensemble approach combining machine learning and deep learning models was then used and evaluated on a 2021 test dataset. The results revealed a marked decline in predictive performance in the test set compared to the cross-validation phase. This degradation was attributed to a distributional shift between training and testing periods, which was quantitatively confirmed by measuring the Wasserstein distance of the most influential feature identified by SHAP analysis. In general, this study presents a methodology that aims to simultaneously address the dual challenges of data imbalance and temporal distributional shifts, and emphasizes the necessity of accounting for evolving external environmental factors when implementing nowcasting models on time-series data.
翻译:大气能见度是影响交通安全和空气质量管理的核心变量,然而由于气象条件与空气污染物之间复杂的相互作用,加之低能见度事件的罕见性,其准确预测仍面临挑战。本研究提出一种机器学习框架,对韩国六大主要城市进行能见度临近预报。为处理2018-2020年训练数据中的类别失衡问题,我们应用了基于名义与连续变量的合成少数类过采样技术(SMOTENC)和条件表格生成对抗网络(CTGAN)。随后构建了融合机器学习与深度学习模型的集成方法,并在2021年测试数据集上评估其性能。结果表明,与交叉验证阶段相比,预测性能在测试集中出现显著下降。通过SHAP分析识别关键特征后,利用Wasserstein距离定量验证了训练期与测试期之间的分布偏移是导致性能退化的主要原因。整体而言,本研究提出了一种旨在同步解决数据失衡与时序分布偏移双重挑战的方法论,并强调在时间序列数据上实施临近预报模型时必须考虑外部环境因素的动态演变。