The deployment of artificial intelligence is increasingly constrained by limited site-level power capacity, which must support both compute systems and non-compute systems (primarily cooling) at all times. Cooling power demand, especially in non-evaporative cooling systems, can increase substantially with ambient temperature in the summer, producing recurring periods of elevated cooling power that often lasts for multiple hours per day. Therefore, maximizing compute capacity under a limited site-level power budget is an important planning and operational challenge. Sizing the compute system conservatively based on peak cooling power can leave part of the site-level power capacity underutilized when the cooling power is below its peak, particularly in cooler months. On the other hand, sizing the compute system aggressively based on low cooling power can cause the total site-level power demand to exceed the site-level power capacity during hot days in the summer. This paper proposes ComputeAmp (Compute Amplifier), a framework that maximizes the compute capacity by jointly and dynamically leveraging cooling, battery energy storage, and computing-based adaptation. We discuss the opportunities and limitations of ComputeAmp and illustrate its potential to significantly expand usable compute capacity within local power and water resource limits. We also present a problem formulation for ComputeAmp and highlight a few algorithmic and operational challenges.
翻译:人工智能的部署日益受限于场址级别的电力容量限制,该容量需始终同时支持计算系统与非计算系统(主要为冷却)。在非蒸发冷却系统中,冷却电力需求会随夏季环境温度显著上升,产生持续数小时且每日重复出现的高峰期冷却用电。因此,在场址级电力预算有限的情况下最大化计算能力,是规划与运营面临的重要挑战。若基于峰值冷却功率保守设计计算系统规模,当冷却功率低于峰值时(尤其在凉爽月份),会导致部分场址级电力容量利用不足。反之,若基于低冷却功率激进设计计算系统规模,则可能使夏季高温日的总场址级电力需求超出容量上限。本文提出ComputeAmp(计算放大器)框架,通过动态联合利用冷却、电池储能及基于计算的自适应机制最大化计算能力。我们探讨了ComputeAmp的机遇与局限性,并阐述其在本地电力与水资源约束下显著扩展可用计算能力的潜力。同时给出ComputeAmp的问题形式化表述,并指出若干算法与运营挑战。