To mitigate the increasingly common underutilization of computational resources in modern GPUs, spatial sharing methods enable multiple applications to use them simultaneously. This work presents a comprehensive evaluation of NVIDIA's primary technologies to achieve that goal: Multi-Process Service (MPS) and Multi-Instance GPU (MIG). Our findings reveal a crucial trade-off between MPS's flexibility and MIG's isolation, and provide many key insights for improving the co-execution strategy according to job profiles. In the most favorable scenarios, MPS improves performance by up to 30% and reduces energy by about 20%, using its provisioning option to avoid resource monopolization. However, under memory contention, it suffers severe degradation, worsening performance by around 30%. Conversely, MIG's full hardware isolation resolves memory contention, leading to more consistent improvements, but these gains are tempered by higher overhead, and its rigid scheme can degrade performance in certain cases.
翻译:为缓解现代GPU中日益普遍的计算资源利用率不足问题,空间共享方法允许多个应用程序同时使用这些资源。本文针对NVIDIA实现该目标的核心技术——多进程服务(MPS)与多实例GPU(MIG)——开展了综合评估。研究发现二者存在关键权衡:MPS的灵活性与MIG的隔离性之间需要取舍,并依据任务特征为改进协同执行策略提供了多项重要启示。在最优场景下,MPS通过启用资源配给配置以避免资源独占,可使性能提升最高30%、能耗降低约20%。然而在内存争用条件下,该方法会出现严重性能退化,导致性能下降约30%。反观MIG,其完备的硬件隔离机制消除了内存争用问题,从而带来更稳定的性能提升,但较高的开销削弱了增益效果,且其僵化的配置结构在某些场景下可能导致性能降级。