Modern GPU-rich HPC systems are increasingly becoming energy-constrained. Thus, understanding an application's energy consumption becomes essential. Unfortunately, current GPU energy attribution techniques are either inaccurate, inflexible, or outdated. Therefore, we propose Wattchmen, a flexible methodology for measuring, attributing, and predicting GPU energy consumption. We construct a per-instruction energy model using a diverse set of microbenchmarks to systematically quantify the energy consumption of GPU instructions, enabling finer-grain prediction and energy consumption breakdowns for applications. Compared with the state-of-the-art systems like AccelWattch (32%) and Guser (25%), across 16 popular GPGPU, graph analytics, HPC, and ML workloads, Wattchmen reduces the mean absolute percent error (MAPE) to 14% on V100 GPUs. Furthermore, we show that Wattchmen provides similar MAPEs for water-cooled V100s (15%) and extends to later architectures, including air-cooled A100 (11%) and H100 (12%) GPUs. Finally, to further demonstrate Wattchmen's value, we apply it to applications such as Backprop and QMCPACK, where Wattchmen's insights enable energy reductions of up to 35%.
翻译:现代富含GPU的高性能计算系统正日益受到能耗约束。因此,理解应用程序的能耗变得至关重要。不幸的是,当前的GPU能耗归因技术要么不准确、不灵活,要么已过时。为此,我们提出Wattchmen——一种用于测量、归因和预测GPU能耗的灵活方法。我们通过一组多样化的微基准测试构建了每条指令的能耗模型,以系统性地量化GPU指令的能耗,从而实现对应用程序的更细粒度预测和能耗分解。与现有最先进系统(如AccelWattch的32%和Guser的25%)相比,在16种流行的GPGPU、图分析、高性能计算和机器学习工作负载上,Wattchmen将V100 GPU上的平均绝对百分比误差降低至14%。此外,我们展示了Wattchmen在水冷式V100上具有类似的平均绝对百分比误差(15%),并能扩展到后续架构,包括风冷式A100(11%)和H100(12%)GPU。最后,为进一步证明Wattchmen的价值,我们将其应用于Backprop和QMCPACK等应用程序,Wattchmen的洞察使其能耗降低高达35%。