Energy is now a critical ML computing resource. While measuring energy consumption and observing trends is a valuable first step, accurately understanding and diagnosing why those differences occur is crucial for optimization. To that end, we begin by presenting a large-scale measurement study of inference time and energy across the generative AI landscape with 46 models, 7 tasks, and 1,858 different configurations on NVIDIA H100 and B200 GPUs. Our empirical findings span order-of-magnitude variations: LLM task type can lead to 25$\times$ energy differences, video generation sometimes consumes more than 100$\times$ the energy of images, and GPU utilization differences can result in 3--5$\times$ energy differences. Based on our observations, we present a framework for reasoning about the underlying mechanisms that govern time and energy consumption. The essence is that time and energy are determined by latent metrics like memory and utilization, which are in turn affected by various factors across the algorithm, software, and hardware layers. Our framework also extends directly to throughput per watt, a critical metric for power-constrained datacenters.
翻译:能量已成为机器学习计算的关键资源。尽管测量能耗并观察趋势是重要的第一步,但准确理解并诊断差异产生的原因对于优化至关重要。为此,我们首先开展了一项大规模测量研究,涵盖生成式AI领域的推理时间和能耗,涉及46个模型、7项任务以及1,858种不同配置,测试平台为NVIDIA H100和B200 GPU。我们的实证研究发现存在数量级差异:大语言模型任务类型可导致25倍的能耗差异,视频生成的能耗有时超过图像的100倍,而GPU利用率差异可能导致3-5倍的能耗变化。基于这些观察,我们提出了一个用于理解支配时间和能耗的底层机制的框架。其核心在于:时间和能耗由内存和利用率等潜在指标决定,而这些指标又受到算法、软件和硬件各层多种因素的影响。我们的框架还可直接扩展至每瓦吞吐量这一功率受限数据中心的关键指标。