The rising demand for generative large language models (LLMs) poses challenges for thermal and power management in cloud datacenters. Traditional techniques often are inadequate for LLM inference due to the fine-grained, millisecond-scale execution phases, each with distinct performance, thermal, and power profiles. Additionally, LLM inference workloads are sensitive to various configuration parameters (e.g., model parallelism, size, and quantization) that involve trade-offs between performance, temperature, power, and output quality. Moreover, clouds often co-locate SaaS and IaaS workloads, each with different levels of visibility and flexibility. We propose TAPAS, a thermal- and power-aware framework designed for LLM inference clusters in the cloud. TAPAS enhances cooling and power oversubscription capabilities, reducing the total cost of ownership (TCO) while effectively handling emergencies (e.g., cooling and power failures). The system leverages historical temperature and power data, along with the adaptability of SaaS workloads, to: (1) efficiently place new GPU workload VMs within cooling and power constraints, (2) route LLM inference requests across SaaS VMs, and (3) reconfigure SaaS VMs to manage load spikes and emergency situations. Our evaluation on a large GPU cluster demonstrates significant reductions in thermal and power throttling events, boosting system efficiency.
翻译:生成式大语言模型日益增长的需求给云数据中心的散热与功耗管理带来了挑战。传统技术通常难以适用于LLM推理,因为其执行阶段具有细粒度、毫秒级的特点,每个阶段都呈现出不同的性能、热和功耗特征。此外,LLM推理工作负载对多种配置参数(例如模型并行度、大小和量化)敏感,这些参数涉及性能、温度、功耗和输出质量之间的权衡。再者,云平台通常同时运行SaaS和IaaS工作负载,二者在可见性和灵活性方面存在差异。我们提出了TAPAS,一个专为云中LLM推理集群设计的热感知与功耗感知框架。TAPAS增强了冷却和功耗超额订阅能力,在有效处理紧急情况(例如冷却和电源故障)的同时降低了总体拥有成本。该系统利用历史温度和功耗数据,结合SaaS工作负载的适应性,实现以下目标:(1) 在满足冷却和功耗约束的前提下高效部署新的GPU工作负载虚拟机;(2) 在SaaS虚拟机间路由LLM推理请求;(3) 重新配置SaaS虚拟机以应对负载峰值和紧急情况。我们在大型GPU集群上的评估表明,该系统显著减少了因过热和功耗限制导致的性能降频事件,从而提升了系统效率。