To mitigate the increasingly common underutilization of computational resources in modern GPUs, spatial sharing methods enable multiple applications to use them simultaneously. This work presents a comprehensive evaluation of NVIDIA's primary technologies to achieve that goal: Multi-Process Service (MPS) and Multi-Instance GPU (MIG). Our findings reveal a crucial trade-off between MPS's flexibility and MIG's isolation, and provide many key insights for improving the co-execution strategy according to job profiles. In the most favorable scenarios, MPS improves performance by up to 30% and reduces energy by about 20%, using its provisioning option to avoid resource monopolization. However, under memory contention, it suffers severe degradation, worsening performance by around 30%. Conversely, MIG's full hardware isolation resolves memory contention, leading to more consistent improvements, but these gains are tempered by higher overhead, and its rigid scheme can degrade performance in certain cases.
翻译:为缓解现代GPU中日益常见的计算资源利用不足问题,空间共享方法允许多个应用程序同时使用这些资源。本文对NVIDIA实现该目标的两项核心技术——多进程服务(MPS)与多实例GPU(MIG)——进行了综合评估。研究揭示了MPS的灵活性与MIG的隔离性之间存在关键权衡,并根据任务特征为改进协同执行策略提供了多项重要见解。在最有利的场景中,MPS通过启用资源分配选项避免资源垄断,可将性能提升高达30%并降低约20%的能耗。然而在内存争用情况下,MPS会遭受严重性能退化,性能下降约30%。相比之下,MIG的完全硬件隔离解决了内存争用问题,带来更一致的性能提升,但较高开销制约了其增益,且其僵化的分配机制在某些情况下可能造成性能退化。