Large-scale AI model training workloads use thousands of GPUs operating in tightly synchronized loops. During synchronous communication, start-up, shut-down, and checkpointing, GPU power consumption can swing from peak to idle within milliseconds. These large and rapid load swings endanger grid infrastructure as they induce steep power ramp rates, voltage and frequency shifts, and reactive power transients that can damage transformers, converters, and protection equipment. To solve this problem, we introduce EasyRider, a power architecture to mitigate power fluctuations at the rack level. EasyRider uses passive components and actively-controlled auxiliary energy storage to attenuate rack power swings. A software system continually monitors the energy storage system to maximize its lifetime in the presence of frequent charge/discharge cycles. EasyRider filters rack power variations to be within grid safety requirements without requiring software modifications to AI training frameworks or wasting energy. We evaluate EasyRider on a 400VDC-rated prototype system against published workload traces and our own GPU testbed, demonstrating its effectiveness across heterogeneous power levels and workload power profiles.
翻译:大规模AI模型训练工作负载使用数千个GPU在紧密同步的循环中运行。在同步通信、启动、关闭和检查点过程中,GPU功耗可在毫秒内从峰值摆动至空闲状态。这些大且快速的负载波动会引发陡峭的功率变化率、电压和频率偏移以及无功功率瞬变,从而危及变压器、变流器和保护设备等电网基础设施。为解决此问题,我们提出EasyRider,一种在机架层面缓解功率波动的功率架构。EasyRider采用无源组件和主动控制的辅助储能装置来抑制机架功率波动。一个软件系统持续监测储能系统,以在频繁充放电循环下最大化其使用寿命。EasyRider将机架功率变化过滤至电网安全要求范围内,且无需修改AI训练框架软件或浪费能量。我们在额定400VDC的原型系统上,基于公开的工作负载轨迹和自有GPU测试平台对EasyRider进行了评估,证明了其在异构功率水平和负载功率剖面下的有效性。