Recent innovation in large language models (LLMs), and their myriad use-cases have rapidly driven up the compute capacity demand for datacenter GPUs. Several cloud providers and other enterprises have made substantial plans of growth in their datacenters to support these new workloads. One of the key bottleneck resources in datacenters is power, and given the increasing model sizes of LLMs, they are becoming increasingly power intensive. In this paper, we show that there is a significant opportunity to oversubscribe power in LLM clusters. Power oversubscription improves the power efficiency of these datacenters, allowing more deployable servers per datacenter, and reduces the deployment time, since building new datacenters is slow. We extensively characterize the power consumption patterns of a variety of LLMs and their configurations. We identify the differences between the inference and training power consumption patterns. Based on our analysis of these LLMs, we claim that the average and peak power utilization in LLM clusters for inference should not be very high. Our deductions align with the data from production LLM clusters, revealing that inference workloads offer substantial headroom for power oversubscription. However, the stringent set of telemetry and controls that GPUs offer in a virtualized environment, makes it challenging to have a reliable and robust power oversubscription mechanism. We propose POLCA, our framework for power oversubscription that is robust, reliable, and readily deployable for GPU clusters. Using open-source models to replicate the power patterns observed in production, we simulate POLCA and demonstrate that we can deploy 30% more servers in the same GPU cluster for inference, with minimal performance loss
翻译:近年来,大型语言模型(LLM)及其多样化应用场景的快速创新,显著拉动了数据中心GPU算力需求的增长。多家云服务提供商及企业已制定大规模数据中心扩建计划以支撑这些新型工作负载。电力作为数据中心的关键瓶颈资源之一,随着LLM模型规模持续扩大,其能耗强度正日益攀升。本文研究表明,LLM集群中存在显著的电力超量订阅机会。通过电力超量订阅,可提升数据中心的电力利用效率,使单个数据中心能够部署更多服务器,同时因新建数据中心进度缓慢,此举还能缩短部署周期。我们广泛分析了各类LLM及其配置的功耗模式,识别出推理与训练阶段功耗模式的差异。基于对LLM的深入分析,我们断言LLM推理集群的平均峰值功率利用率不宜过高。该推论与生产环境LLM集群的数据相吻合,表明推理工作负载为电力超量订阅提供了充足余量。然而,GPU在虚拟化环境中提供的高度精细的遥测与控制机制,使得构建可靠稳健的电力超量订阅体系面临挑战。为此,我们提出POLCA框架——一种适用于GPU集群的稳健、可靠且可即用部署的电力超量订阅方案。通过使用开源模型复现生产环境中的功耗模式,我们模拟了POLCA的运行,证明在推理场景下,该方案可在GPU集群中额外部署30%的服务器,且性能损耗极小。