In Earth Systems Science, many complex data pipelines combine different data sources and apply data filtering and analysis steps. Typically, such data analysis processes are historically grown and implemented with many sequentially executed scripts. Scientific workflow management systems (SWMS) allow scientists to use their existing scripts and provide support for parallelization, reusability, monitoring, or failure handling. However, many scientists still rely on their sequentially called scripts and do not profit from the out-of-the-box advantages a SWMS can provide. In this work, we transform the data analysis processes of a Machine Learning-based approach to calibrate the platform magnetometers of non-dedicated satellites utilizing neural networks into a workflow called Macaw (MAgnetometer CAlibration Workflow). We provide details on the workflow and the steps needed to port these scripts to a scientific workflow. Our experimental evaluation compares the original sequential script executions on the original HPC cluster with our workflow implementation on a commodity cluster. Our results show that through porting, our implementation decreased the allocated CPU hours by 50.2% and the memory hours by 59.5%, leading to significantly less resource wastage. Further, through parallelizing single tasks, we reduced the runtime by 17.5%.
翻译:在地球系统科学中,许多复杂数据管道整合了不同数据源,并应用数据过滤与分析步骤。此类数据分析流程通常历史积淀而成,由众多按序执行的脚本实现。科学工作流管理系统(Scientific Workflow Management Systems, SWMS)允许科学家使用现有脚本,并提供并行化、可重用性、监控及故障处理等支持。然而,许多科学家仍依赖其按序调用的脚本,未能受益于SWMS即开即用的优势。本研究将基于机器学习的非专用卫星平台磁力计标定数据分析流程,转化为名为Macaw(磁力计标定工作流)的工作流。我们详细阐述了该工作流及其将脚本移植至科学工作流所需的步骤。实验评估将原始高性能计算集群上的脚本串行执行与我们在通用集群上的工作流实现进行对比。结果表明,通过移植,我们的实现将分配的CPU小时数降低了50.2%,内存小时数降低了59.5%,显著减少了资源浪费。此外,通过单个任务的并行化,我们将运行时间缩短了17.5%。