Machine learning models for forecasting solar flares have been trained and evaluated using a variety of data sources, including Space Weather Prediction Center (SWPC) operational and science-quality data. Typically, data from these sources is minimally processed before being used to train and validate a forecasting model. However, predictive performance can be affected if defects and inconsistencies between these data sources are ignored. For a set of commonly used data sources, along with the software that queries and outputs processed data, we identify their defects and inconsistencies, quantify their extent, and show how they can affect predictions from data-driven machine-learning forecasting models. We also outline procedures for fixing these issues or at least mitigating their impacts. Finally, based on thorough comparisons of the effects of data sources on the trained forecasting model's predictive skill scores, we offer recommendations for using different data products in operational forecasting.
翻译:用于预报太阳耀斑的机器学习模型通常基于多种数据源进行训练与评估,包括空间天气预测中心(SWPC)的业务级与科学级数据。这些数据在用于训练和验证预报模型前通常仅经过最低限度处理。然而,若忽略不同数据源之间的缺陷与不一致性,模型的预测性能可能受到影响。针对一组常用数据源及其数据查询与处理输出软件,本文系统识别了其中存在的缺陷与不一致性,量化了其影响程度,并展示了这些因素如何影响数据驱动的机器学习预报模型的预测结果。同时,我们提出了修正这些问题或至少减轻其影响的处理流程。最后,通过深入比较不同数据源对训练后预报模型预测技能评分的影响,我们为业务预报中如何选用不同数据产品提出了具体建议。