Scientific workflows are pipelines of interdependent tasks. They are increasingly executed on shared Kubernetes clusters via workflow engines such as Nextflow. Their energy consumption matters for both cost and sustainability. It is necessary to examine and optimize workflow tasks individually, because they can be very heterogeneous. However, estimating task-level energy on clusters is difficult: Intel RAPL counters report only node-level energy, access to counters and host process information is typically restricted, and concurrent workloads introduce resource contention and measurement noise. We present Nf-PEAK, a containerized method to attribute CPU-package and DRAM energy to individual processes and Nextflow tasks. Nf-PEAK (i) identifies workflow pods, (ii) maps pods to host processes via cgroup metadata, (iii) samples RAPL and per-process performance counters, and (iv) applies a non-linear energy-credit model before aggregating results at task level. On a Kubernetes cluster, we evaluate three nf-core workflows under controlled co-located CPU load. Nf-PEAK reaches an average Mean Absolute Percentage Error of 6.6% in isolated runs and 10.9% when an unrelated workload saturates 8 of 32 hardware threads per node, and remains stable across 2, 3, 4, and 8 nodes. Compared to the state-of-the-art Kubernetes tool Kepler, Nf-PEAK yields lower error on average, particularly under co-located load.
翻译:科学工作流是由相互依赖的任务组成的流水线。这些工作流越来越多地通过Nextflow等工作流引擎在共享的Kubernetes集群上执行。其能源消耗对成本与可持续性均至关重要。由于工作流任务可能具有高度异构性,因此有必要对单个任务进行独立检查与优化。然而,在集群上估算任务级能耗存在诸多困难:Intel RAPL计数器仅报告节点级能耗,计数器与主机进程信息的访问通常受限,同时并发工作负载会引发资源争用与测量噪声。我们提出Nf-PEAK,这是一种容器化方法,可将CPU封装与DRAM能耗归因至单个进程与Nextflow任务。Nf-PEAK通过以下步骤实现:(i)识别工作流Pod,(ii)通过cgroup元数据将Pod映射至主机进程,(iii)采样RAPL与逐进程性能计数器,(iv)在任务级聚合结果前应用非线性能量信用模型。我们在Kubernetes集群上,于受控的共置CPU负载下评估了三个nf-core工作流。在隔离运行中,Nf-PEAK的平均绝对百分比误差为6.6%;当无关工作负载使每个节点32个硬件线程中的8个饱和时,该误差为10.9%,且在2、3、4、8个节点的配置下保持稳定。与当前最先进的Kubernetes工具Kepler相比,Nf-PEAK的平均误差更低,尤其是在共置负载条件下。