With the increasing amount of data available to scientists in disciplines as diverse as bioinformatics, physics, and remote sensing, scientific workflow systems are becoming increasingly important for composing and executing scalable data analysis pipelines. When writing such workflows, users need to specify the resources to be reserved for tasks so that sufficient resources are allocated on the target cluster infrastructure. Crucially, underestimating a task's memory requirements can result in task failures. Therefore, users often resort to overprovisioning, resulting in significant resource wastage and decreased throughput. In this paper, we propose a novel online method that uses monitoring time series data to predict task memory usage in order to reduce the memory wastage of scientific workflow tasks. Our method predicts a task's runtime, divides it into k equally-sized segments, and learns the peak memory value for each segment depending on the total file input size. We evaluate the prototype implementation of our method using workflows from the publicly available nf-core repository, showing an average memory wastage reduction of 29.48% compared to the best state-of-the-art approach.
翻译:随着生物信息学、物理学和遥感等不同学科领域科学家可获取的数据量不断增长,科学工作流系统在构建和执行可扩展数据分析流水线中的重要性日益凸显。编写此类工作流时,用户需要为任务指定预留资源量,以确保目标集群基础设施上能够分配足够资源。关键在于,低估任务内存需求可能导致任务失败。因此,用户常采用过度配置策略,造成严重的资源浪费和吞吐量下降。本文提出一种新型在线方法,利用监控时间序列数据预测任务内存使用量,以减少科学工作流任务的内存浪费。该方法通过预测任务运行时间,将其划分为k个等长时段,并依据文件总输入大小学习每个时段的峰值内存值。我们利用公开的nf-core仓库中的工作流对所提方法的原型实现进行评测,结果显示,与最新最优方法相比,平均内存浪费减少了29.48%。