Advances in GPU compute throughput and memory capacity brings significant opportunities to a wide range of workloads. However, efficiently utilizing these resources remains challenging, particularly because diverse application characteristics may result in imbalanced utilization. Multi-Instance GPU (MIG) is a promising approach to improve utilization by partitioning GPU compute and memory resources into fixed-size slices with isolation. Yet, its effectiveness and limitations in supporting HPC workloads remain an open question. We present a comprehensive system-level characterization of different GPU sharing options using real-world scientific, AI, and data analytics applications, including NekRS, LAMMPS, Llama3, and Qiskit. Our analysis reveals that while GPU sharing via MIG can significantly reduce resource underutilization, and enable system-level improvements in throughput and energy, interference still occurs through shared resources, such as power throttling. Our performance-resource scaling results indicate that coarse-grained provisioning for tightly coupled compute and memory resources often mismatches application needs. To address this mismatch, we propose a memory-offloading scheme that leverages the cache-coherent Nvlink-C2C interconnect to bridge the gap between coarse-grained resource slices and reduce resource underutilization.
翻译:暂无翻译