A computational workflow, also known as workflow, consists of tasks that must be executed in a specific order to attain a specific goal. Often, in fields such as biology, chemistry, physics, and data science, among others, these workflows are complex and are executed in large-scale, distributed, and heterogeneous computing environments prone to failures and performance degradation. Therefore, anomaly detection for workflows is an important paradigm that aims to identify unexpected behavior or errors in workflow execution. This crucial task to improve the reliability of workflow executions can be further assisted by machine learning-based techniques. However, such application is limited, in large part, due to the lack of open datasets and benchmarking. To address this gap, we make the following contributions in this paper: (1) we systematically inject anomalies and collect raw execution logs from workflows executing on distributed infrastructures; (2) we summarize the statistics of new datasets, and provide insightful analyses; (3) we convert workflows into tabular, graph and text data, and benchmark with supervised and unsupervised anomaly detection techniques correspondingly. The presented dataset and benchmarks allow examining the effectiveness and efficiency of scientific computational workflows and identifying potential research opportunities for improvement and generalization. The dataset and benchmark code are publicly available \url{https://poseidon-workflows.github.io/FlowBench/} under the MIT License.
翻译:计算工作流(亦称工作流)由一系列必须按特定顺序执行以达成特定目标的任务组成。在生物学、化学、物理学和数据科学等诸多领域中,此类工作流通常结构复杂,并在易发生故障与性能退化的大规模分布式异构计算环境中执行。因此,工作流异常检测作为一种重要范式,旨在识别工作流执行过程中的异常行为或错误。这项提升工作流执行可靠性的关键任务,可进一步借助基于机器学习的技术予以辅助。然而,此类应用在很大程度上受限于公开数据集与基准测试的缺乏。为填补这一空白,本文作出以下贡献:(1)我们系统性地注入异常,并收集分布式基础设施上执行工作流的原始运行日志;(2)我们汇总新数据集的统计特征,并提供具有洞察力的分析;(3)我们将工作流转换为表格、图结构与文本数据,并分别采用监督式与无监督异常检测技术进行基准测试。所提供的数据集与基准测试方案,可用于检验科学计算工作流的有效性与效率,并为改进与泛化识别潜在的研究机遇。数据集与基准测试代码已在MIT许可下公开提供:\url{https://poseidon-workflows.github.io/FlowBench/}。