With the increasing amount of data available to scientists in disciplines as diverse as bioinformatics, physics, and remote sensing, scientific workflow systems are becoming increasingly important for composing and executing scalable data analysis pipelines. When writing such workflows, users need to specify the resources to be reserved for tasks so that sufficient resources are allocated on the target cluster infrastructure. Crucially, underestimating a task's memory requirements can result in task failures. Therefore, users often resort to overprovisioning, resulting in significant resource wastage and decreased throughput. In this paper, we propose a novel online method that uses monitoring time series data to predict task memory usage in order to reduce the memory wastage of scientific workflow tasks. Our method predicts a task's runtime, divides it into k equally-sized segments, and learns the peak memory value for each segment depending on the total file input size. We evaluate the prototype implementation of our method using workflows from the publicly available nf-core repository, showing an average memory wastage reduction of 29.48% compared to the best state-of-the-art approach
翻译:随着生物信息学、物理学和遥感等不同学科中科学家可获取的数据量不断增加,科学工作流系统在编写和执行可扩展数据分析流程方面变得日益重要。在编写此类工作流时,用户需要指定任务需预留的资源,以便在目标集群基础设施上分配足够的资源。关键的是,低估任务的内存需求可能导致任务失败。因此,用户往往采取过度配置的方式,从而导致严重的资源浪费和吞吐量下降。本文提出了一种新颖的在线方法,利用监控时间序列数据预测任务内存使用情况,以减少科学工作流任务的内存浪费。该方法预测任务的运行时间,将其划分为k个等长的段,并根据总文件输入大小学习每个段的峰值内存值。我们利用公开可用的nf-core存储库中的工作流对本方法的原型实现进行了评估,结果表明,与最先进的现有方法相比,平均内存浪费减少了29.48%。