The rapid growth of generative artificial intelligence (AI) has introduced unprecedented computational demands, driving significant increases in the energy footprint of data centers. However, existing power consumption data is largely proprietary and reported at varying resolutions, creating challenges for estimating whole-facility energy use and planning infrastructure. In this work, we present a methodology that bridges this gap by linking high-resolution workload power measurements to whole-facility energy demand. Using NLR's high-performance computing data center equipped with NVIDIA H100 GPUs, we measure power consumption of AI workloads at 0.1-second resolution for AI training, fine-tuning and inference jobs. Workloads are characterized using MLCommons benchmarks for model training and fine-tuning, and vLLM benchmarks for inference, enabling reproducible and standardized workload profiling. The dataset of power consumption profiles is made publicly available. These power profiles are then scaled to the whole-facility-level using a bottom-up, event-driven, data center energy model. The resulting whole-facility energy profiles capture realistic temporal fluctuations driven by AI workloads and user-behavior, and can be used to inform infrastructure planning for grid connection, on-site energy generation, and distributed microgrids.
翻译:生成式人工智能的快速发展带来了前所未有的计算需求,显著推高了数据中心的能源足迹。然而,现有功耗数据大多为专有信息,且报告分辨率各异,这给估算整设施能耗及规划基础设施带来了挑战。本文提出一种方法论,通过将高分辨率工作负载功耗测量与整设施能源需求相衔接来弥补这一缺口。利用配备NVIDIA H100 GPU的NLR高性能计算数据中心,我们以0.1秒分辨率测量了AI训练、微调和推理作业的功耗。工作负载通过MLCommons基准测试(用于模型训练与微调)和vLLM基准测试(用于推理)进行表征,从而实现可复现、标准化的负载特征分析。该功耗剖面数据集已公开。随后,采用自底向上、事件驱动的数据中心能源模型将这些功耗剖面扩展至整设施层级。由此生成的整设施能源剖面能够捕捉由AI工作负载和用户行为驱动的真实时间波动,并可服务于电网接入、现场发电及分布式微电网等基础设施规划。