The expansion of artificial intelligence (AI) applications has driven substantial investment in computational infrastructure, especially by cloud computing providers. Quantifying the energy footprint of this infrastructure requires models parameterized by the power demand of AI hardware during training. We empirically measured the instantaneous power draw of an 8-GPU NVIDIA H100 HGX node during the training of open-source image classifier (ResNet) and large-language models (Llama2-13b). The maximum observed power draw was approximately 8.4 kW, 18% lower than the manufacturer-rated 10.2 kW, even with GPUs near full utilization. Holding model architecture constant, increasing batch size from 512 to 4096 images for ResNet reduced total training energy consumption by a factor of 4. These findings can inform capacity planning for data center operators and energy use estimates by researchers. Future work will investigate the impact of cooling technology and carbon-aware scheduling on AI workload energy consumption.
翻译:人工智能(AI)应用的扩展推动了计算基础设施的大量投资,尤其是云计算服务提供商。量化该基础设施的能源足迹需要以AI硬件在训练期间的功耗需求为参数建立模型。我们实证测量了配备8个NVIDIA H100 GPU的HGX节点在训练开源图像分类器(ResNet)和大语言模型(Llama2-13b)时的瞬时功耗。观测到的最大功耗约为8.4千瓦,比制造商标称的10.2千瓦低18%,即使GPU接近满载运行。在保持模型架构不变的情况下,将ResNet的批处理大小从512张图像增加到4096张,可使总训练能耗降低至原来的四分之一。这些发现可为数据中心运营商的容量规划以及研究人员的能源使用估算提供参考。未来工作将研究冷却技术和碳感知调度对AI工作负载能耗的影响。