Empirical Measurements of AI Training Power Demand on a GPU-Accelerated Node

The expansion of artificial intelligence (AI) applications has driven substantial investment in computational infrastructure, especially by cloud computing providers. Quantifying the energy footprint of this infrastructure requires models parameterized by the power demand of AI hardware during training. We empirically measured the instantaneous power draw of an 8-GPU NVIDIA H100 HGX node during the training of open-source image classifier (ResNet) and large-language models (Llama2-13b). The maximum observed power draw was approximately 8.4 kW, 18% lower than the manufacturer-rated 10.2 kW, even with GPUs near full utilization. Holding model architecture constant, increasing batch size from 512 to 4096 images for ResNet reduced total training energy consumption by a factor of 4. These findings can inform capacity planning for data center operators and energy use estimates by researchers. Future work will investigate the impact of cooling technology and carbon-aware scheduling on AI workload energy consumption.

翻译：人工智能（AI）应用的扩展推动了计算基础设施的大量投资，尤其是云计算服务提供商。量化该基础设施的能源足迹需要以AI硬件在训练期间的功耗需求为参数建立模型。我们实证测量了配备8个NVIDIA H100 GPU的HGX节点在训练开源图像分类器（ResNet）和大语言模型（Llama2-13b）时的瞬时功耗。观测到的最大功耗约为8.4千瓦，比制造商标称的10.2千瓦低18%，即使GPU接近满载运行。在保持模型架构不变的情况下，将ResNet的批处理大小从512张图像增加到4096张，可使总训练能耗降低至原来的四分之一。这些发现可为数据中心运营商的容量规划以及研究人员的能源使用估算提供参考。未来工作将研究冷却技术和碳感知调度对AI工作负载能耗的影响。

相关内容

关注 7103

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日