Modern GPUs such as the Ampere series (A30, A100) as well as the Hopper series (H100, H200) offer performance as well as security isolation features. They also support a good amount of concurrency, but taking advantage of it can be quite challenging due to the complex constraints on partitioning the chip. In this work, we develop partitioning and scheduling schemes for a variety of workloads, ranging from scientific to modern ML workloads, including LLMs. We develop several schemes involving dynamic memory estimation, partition fusion and partition fission. We also support process restart to recover from out-of-memory errors for workloads and early restart as an optimization. This approach yields up to 6.20x throughput and 5.93x energy improvements for general workloads; and we see 1.59x and 1.12x improvement to throughput and energy, respectively, for ML workloads on an A100 GPU. We leverage this technique on LLM workloads and show good improvements, including up to 1.43x throughput improvement and 1.11x energy savings.
翻译:现代GPU(如Ampere系列的A30、A100以及Hopper系列的H100、H200)不仅提供性能与安全隔离特性,还具备良好的并发支持能力。然而,由于芯片划分存在复杂的约束条件,充分利用其并发性仍面临巨大挑战。本研究针对从科学计算到现代机器学习(包括LLM)的多样化工作负载,开发了相应的划分与调度方案。我们提出了多种技术,包括动态内存估计、分区融合与分区裂变,并支持通过进程重启从工作负载的内存溢出错误中恢复,同时将提前重启作为优化手段。该方法在通用工作负载上实现了高达6.20倍的吞吐量提升和5.93倍的能效改进;在A100 GPU上运行机器学习工作负载时,吞吐量与能效分别获得1.59倍和1.12倍的提升。我们将此技术应用于LLM工作负载,取得了显著改善,包括最高1.43倍的吞吐量提升和1.11倍的节能效果。