Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs

GPU systems are increasingly powering modern datacenters at scale. Despite being highly performant, GPU systems can exhibit performance variation at the node and cluster levels. Such performance variation can significantly impact both high-performance computing and artificial intelligence workloads, such as cutting-edge large language models (LLMs). In this work, we analyze the performance of a single-node multi-GPU system running LLM training, and observe that the kernel-level performance variation is highly correlated with concurrent computation and communication (C3), a technique to overlap computation and communication across GPUs for performance gains. We then take a further step to reason that thermally induced straggling coupled with C3 impacts performance variation, which we coin the Lit Silicon effect. More specifically, Lit Silicon describes that in a multi-GPU node, thermal imbalance across GPUs can introduce node-level straggler GPUs (hotter and slower), which in turn slow down the leader GPUs (cooler and faster). Lit Silicon can lead to node-level performance variation and inefficiency, potentially impacting the entire datacenter. We propose analytical performance and power models for Lit Silicon, to understand the potential system-level gains. We further design simple detection and mitigation techniques to effectively address the Lit Silicon problem, and evaluate three different power management solutions, including (1) power optimization under GPU thermal design power, (2) performance optimization under node-level GPU power capping, and (3) performance optimization under node-level CPU power sloshing. We conduct experiments on two workloads on two AMD InstinctTM MI300X GPU systems under two LLM training frameworks, and observe up to 6% performance and 4% power improvements, potentially saving several tens of millions of dollars in electricity costs in datacenters.

翻译：GPU系统正日益大规模驱动现代数据中心。尽管性能卓越，但GPU系统在节点和集群级别可能存在性能差异。这种性能差异会显著影响高性能计算与人工智能工作负载（例如尖端的大语言模型（LLMs））。本研究分析了运行LLM训练的单个节点多GPU系统的性能，观察到内核级性能差异与并发计算和通信（C3）高度相关——这是一种跨GPU重叠计算与通信以提升性能的技术。我们进一步推理论证，热致拖尾现象与C3耦合会加剧性能差异，并将其命名为"Lit Silicon效应"。具体而言，Lit Silicon描述的是：在多GPU节点中，GPU间的热失衡会引入节点级的拖尾GPU（温度更高且速度更慢），进而拖累领先GPU（温度更低且速度更快）。Lit Silicon可能导致节点级性能差异和效率低下，甚至影响整个数据中心。我们为Lit Silicon提出了分析性性能与功耗模型，以理解潜在的系统级收益。进一步地，我们设计了简单的检测与缓解技术来有效解决Lit Silicon问题，并评估了三种不同的功耗管理方案，包括：（1）GPU热设计功耗约束下的功耗优化；（2）节点级GPU功率上限约束下的性能优化；（3）节点级CPU功率摆动约束下的性能优化。我们在两个AMD Instinct™ MI300X GPU系统上，使用两种LLM训练框架进行实验，观察到最高6%的性能提升和4%的功耗优化，这有可能为数据中心节省数千万美元的电费。