A computational workflow, also known as workflow, consists of tasks that must be executed in a specific order to attain a specific goal. Often, in fields such as biology, chemistry, physics, and data science, among others, these workflows are complex and are executed in large-scale, distributed, and heterogeneous computing environments that are prone to failures and performance degradations. Therefore, anomaly detection for workflows is an important paradigm that aims to identify unexpected behavior or errors in workflow execution. This crucial task to improve the reliability of workflow executions must be assisted by machine learning-based techniques. However, such application is limited, in large part, due to the lack of open datasets and benchmarking. To address this gap, we make the following contributions in this paper: (1) we systematically inject anomalies and collect raw execution logs from workflows executing on distributed infrastructures; (2) we summarize the statistics of new datasets, as well as a set of open datasets, and provide insightful analyses; (3) we benchmark unsupervised anomaly detection techniques by converting workflows into both tabular and graph-structured data. Our findings allow us to examine the effectiveness and efficiencies of the benchmark methods and identify potential research opportunities for improvement and generalization. The dataset and benchmark code are available online with MIT License for public usage.
翻译:计算工作流(简称工作流)由按特定顺序执行以实现特定目标的任务组成。在生物学、化学、物理学和数据科学等领域,这些工作流通常较为复杂,并在大规模、分布式、异构的计算环境中执行,而这类环境容易出现故障和性能下降。因此,工作流异常检测成为一种重要范式,旨在识别工作流执行中的意外行为或错误。这一提高工作流执行可靠性的关键任务需借助基于机器学习的技术。然而,此类应用在很大程度上因缺乏开放数据集和基准测试而受限。为填补这一空白,本文做出以下贡献:(1)系统地注入异常,并从分布式基础设施上执行的工作流中收集原始执行日志;(2)总结新数据集的统计信息及一组开放数据集,并提供深入分析;(3)通过将工作流转换为表格数据和图结构数据,对无监督异常检测技术进行基准测试。我们的发现能够检验基准方法的有效性和效率,并识别出在改进和泛化方面的潜在研究机会。数据集及基准测试代码均已根据MIT许可证在公开平台上提供使用。