The AI datacenters are currently being deployed on a large scale to support the training and deployment of power-intensive large-language models (LLMs). Extensive amount of computation and cooling required in datacenters increase concerns about the energy use and carbon emissions of AI datacenters. Although current state-of-the-art has examined the energy efficiency of LLM inference, most prior research focused on optimizing compute-side scheduling without considering thermal objectives or constraints. Since GPU-intensive inference generates substantial heat that can degrade datacenter performance, ignoring thermal effects can increase total energy consumption and reduce the efficiency of LLM serving. To fill this gap, we profile the characteristics of GPU servers under varying cooling and AI jobs, and develop a joint cooling and computing modeling approach for AI datacenters. Built upon such workload and thermal dynamics models, a novel hierarchical control framework is proposed to co-optimize computing and thermal management by identifying the optimal GPU parallelism, frequency (DVFS), and cooling control knobs. Using real Azure inference traces and detailed GPU profiling, our model balances serving latency and thermal constraints in AI datacenters while significantly improving AI datacenters' energy efficiency.
翻译:当前,人工智能数据中心正大规模部署以支持高功耗大语言模型的训练与部署。数据中心所需的大量计算与冷却资源加剧了人们对AI数据中心能耗与碳排放的担忧。尽管现有前沿研究已探讨了大语言模型推理的能效问题,但多数先前工作聚焦于优化计算侧调度,而未考虑热管理目标或约束。由于GPU密集型推理会产生大量热量,可能降低数据中心性能,忽略热效应将增加总能耗并降低大语言模型服务效率。为填补这一空白,我们分析了GPU服务器在不同冷却条件与AI任务下的运行特性,并开发了一种面向人工智能数据中心的联合冷却与计算建模方法。基于此类工作负载与热动力学模型,本文提出了一种新颖的分层控制框架,通过确定最优的GPU并行度、频率(DVFS)及冷却控制参数,协同优化计算与热管理。利用真实的Azure推理追踪数据与详尽的GPU性能分析,我们的模型在显著提升AI数据中心能效的同时,平衡了服务延迟与热约束条件。