GPUs are vastly underutilized, even when running resource-intensive AI applications, as GPU kernels within each job have diverse resource profiles that may saturate some parts of a device while often leaving other parts idle. Colocating applications is known to improve GPU utilization, but is not common practice as it becomes difficult to provide predictable performance due to workload interference. Providing predictable performance guarantees requires a deep understanding of how applications contend for shared GPU resources such as block schedulers, compute units, L1/L2 caches, and memory bandwidth. We study the key types of GPU resource interference and develop a methodology to quantify the sensitivity of a workload to each type. We discuss how this methodology can serve as the foundation for GPU schedulers that enforce strict performance guarantees and how application developers can design GPU kernels with colocation in mind to improve efficiency.
翻译:GPU的利用率极低,即使在运行资源密集型的AI应用时也是如此,因为每个作业中的GPU内核具有多样化的资源需求特征,可能导致设备的某些部分达到饱和,而其他部分却常常处于闲置状态。已知通过应用共置可以提高GPU利用率,但这并非普遍做法,因为工作负载干扰使得难以提供可预测的性能。要提供可预测的性能保证,需要深入理解应用如何竞争共享的GPU资源,例如块调度器、计算单元、L1/L2缓存以及内存带宽。我们研究了GPU资源干扰的关键类型,并开发了一种方法来量化工作负载对每种干扰类型的敏感性。我们讨论了该方法如何作为强制执行严格性能保证的GPU调度器的基础,以及应用开发者如何设计考虑共置的GPU内核以提高效率。