Recent breakthroughs of large language models (LLMs) have exhibited superior capability across major industries and stimulated multi-hundred-billion-dollar investment in AI-centric data centers in the next 3-5 years. This, in turn, bring the increasing concerns on sustainability and AI-related energy usage. However, there is a largely overlooked issue as challenging and critical as AI model and infrastructure efficiency: the disruptive dynamic power consumption behaviour. With fast, transient dynamics, AI infrastructure features ultra-low inertia, sharp power surge and dip, and a significant peak-idle power ratio. The power scale covers from several hundred watts to megawatts, even to gigawatts. These never-seen-before characteristics make AI a very unique load and pose threats to the power grid reliability and resilience. To reveal this hidden problem, this paper examines the scale of AI power consumption, analyzes AI transient behaviour in various scenarios, develops high-level mathematical models to depict AI workload behaviour and discusses the multifaceted challenges and opportunities they potentially bring to existing power grids. Observing the rapidly evolving machine learning (ML) and AI technologies, this work emphasizes the critical need for interdisciplinary approaches to ensure reliable and sustainable AI infrastructure development, and provides a starting point for researchers and practitioners to tackle such challenges.
翻译:近年来,大语言模型(LLMs)取得的突破性进展已在各主要行业展现出卓越能力,并推动了未来3-5年内数千亿美元投向以AI为中心的数据中心建设。这进而引发了人们对可持续性及AI相关能耗日益增长的担忧。然而,一个与AI模型及基础设施效率同等严峻且关键的问题在很大程度上被忽视了:具有破坏性的动态功耗行为。AI基础设施具有快速、瞬态的动态特性,表现为超低惯性、剧烈的功率骤升骤降以及显著的峰值-空闲功率比。其功率规模涵盖从数百瓦到兆瓦级,甚至可达吉瓦级。这些前所未有的特性使AI成为一种极其特殊的负载,对电网的可靠性与韧性构成威胁。为揭示这一隐性问题,本文审视了AI能耗的规模,分析了多种场景下AI的瞬态行为,建立了描述AI工作负载行为的高层数学模型,并探讨了它们可能对现有电网带来的多维度挑战与机遇。基于对快速发展的机器学习(ML)与AI技术的观察,本研究强调需要采用跨学科方法以确保AI基础设施发展的可靠性与可持续性,并为研究人员与实践者应对此类挑战提供了切入点。