Efficient utilization of GPU resources and power has become critical with the growing demand for GPUs in high-performance computing (HPC). In this paper, we analyze GPU utilization and GPU memory utilization, as well as the power consumption of the Vienna ab initio Simulation Package (VASP), using the Slurm workload manager historical logs and GPU performance metrics collected by NVIDIA's Data Center GPU Manager (DCGM). VASP is a widely used materials science application on Perlmutter at NERSC, an HPE Cray EX system based on NVIDIA A100 GPUs. Using our insights from the resource utilization analysis of VASP applications, we propose a resource prediction framework to predict the average GPU power, maximum GPU utilization, and maximum GPU memory utilization values of heterogeneous HPC system applications to enable more efficient scheduling decisions and power-aware system operation. Our prediction framework consists of two stages: 1) using only the Slurm accounting logs as training data and 2) augmenting the training data with historical GPU profiling metrics collected with DCGM. The maximum GPU utilization predictions using only the Slurm submission features achieve up to 97% accuracy. Furthermore, features engineered from GPU-compute and memory activity metrics exhibit good correlations with average power utilization, and our runtime power usage prediction experiments result in up to 92% prediction accuracy. These findings demonstrate the effectiveness of DCGM metrics in capturing application characteristics and highlight their potential for developing predictive models to support dynamic power management in HPC systems.
翻译:随着高性能计算(HPC)中GPU需求的持续增长,高效利用GPU资源与功耗已成为关键问题。本文利用Slurm工作负载管理器的历史日志以及NVIDIA数据中心GPU管理器(DCGM)收集的GPU性能指标,分析了Vienna ab initio Simulation Package(VASP)的GPU利用率、GPU内存利用率及功耗特征。VASP是一种广泛应用于材料科学领域的应用程序,运行于基于NVIDIA A100 GPU的NERSC Perlmutter系统(HPE Cray EX架构)之上。基于对VASP应用资源利用率的分析洞察,我们提出了一种资源预测框架,用于预测异构HPC系统应用的平均GPU功耗、最大GPU利用率和最大GPU内存利用率,从而支持更高效的调度决策与功耗感知系统运行。该预测框架包含两个阶段:第一阶段仅使用Slurm记账日志作为训练数据,第二阶段则通过DCGM收集的历史GPU性能指标来扩充训练数据。仅利用Slurm提交特征的最大GPU利用率预测准确率高达97%。此外,基于GPU计算与内存活动指标构建的特征与平均功耗利用率展现出良好相关性,我们的运行时功耗预测实验准确率最高可达92%。这些发现证实了DCGM指标在捕获应用特征方面的有效性,并凸显了其在开发支持HPC系统动态功耗管理的预测模型中的潜力。