As research and deployment of AI grows, the computational burden to support and sustain its progress inevitably does too. To train or fine-tune state-of-the-art models in NLP, computer vision, etc., some form of AI hardware acceleration is virtually a requirement. Recent large language models require considerable resources to train and deploy, resulting in significant energy usage, potential carbon emissions, and massive demand for GPUs and other hardware accelerators. However, this surge carries large implications for energy sustainability at the HPC/datacenter level. In this paper, we study the aggregate effect of power-capping GPUs on GPU temperature and power draw at a research supercomputing center. With the right amount of power-capping, we show significant decreases in both temperature and power draw, reducing power consumption and potentially improving hardware life-span with minimal impact on job performance. While power-capping reduces power draw by design, the aggregate system-wide effect on overall energy consumption is less clear; for instance, if users notice job performance degradation from GPU power-caps, they may request additional GPU-jobs to compensate, negating any energy savings or even worsening energy consumption. To our knowledge, our work is the first to conduct and make available a detailed analysis of the effects of GPU power-capping at the supercomputing scale. We hope our work will inspire HPCs/datacenters to further explore, evaluate, and communicate the impact of power-capping AI hardware accelerators for more sustainable AI.
翻译:随着人工智能的研究与部署不断增长,支撑并维持其进步的计算负担也必然随之增加。在自然语言处理、计算机视觉等领域训练或微调先进模型时,某种形式的AI硬件加速几乎是必需的。近期的大型语言模型需要大量资源进行训练与部署,导致显著的能源消耗、潜在碳排放,以及对GPU和其它硬件加速器的巨大需求。然而,这一激增对高性能计算(HPC)与数据中心层面的能源可持续性产生了重大影响。本文研究了在研究型超级计算中心内,对GPU实施功率封顶对其温度和功耗的总体效应。通过设定适当的功率上限,我们观察到GPU温度与功耗均显著下降,从而降低电力消耗,并可能延长硬件寿命,同时对作业性能影响极小。尽管功率封顶本意是减少功耗,但系统整体能耗的宏观影响尚不明确;例如,若用户因GPU功率限制而察觉作业性能下降,可能请求额外GPU作业来弥补,这会抵消节能效果,甚至增加能源消耗。据我们所知,本研究首次在超级计算规模下开展并公开了GPU功率封顶影响的详细分析。我们希望这项工作能激励更多HPC/数据中心进一步探索、评估并传达对AI硬件加速器实施功率封顶的影响,以实现更可持续的人工智能。