This work analyzes the main isolation mechanisms available in modern NVIDIA GPUs: MPS, MIG, and the recent Green Contexts, to ensure predictable inference time in safety-critical applications using deep learning models. The experimental methodology includes performance tests, evaluation of partitioning impact, and analysis of temporal isolation between processes, considering both the NVIDIA A100 and Jetson Orin platforms. It is observed that MIG provides a high level of isolation. At the same time, Green Contexts represent a promising alternative for edge devices by enabling fine-grained SM allocation with low overhead, albeit without memory isolation. The study also identifies current limitations and outlines potential research directions to improve temporal predictability in shared GPUs.
翻译:本研究分析了现代NVIDIA GPU中可用的主要隔离机制:MPS、MIG以及最新的Green Contexts,旨在确保使用深度学习模型的安全关键应用具有可预测的推理时间。实验方法包括性能测试、分区影响评估以及进程间时序隔离分析,同时考虑了NVIDIA A100和Jetson Orin平台。研究发现,MIG提供了高水平的隔离性。与此同时,Green Contexts通过实现细粒度的流多处理器分配且开销较低,为边缘设备提供了一种有前景的替代方案,尽管其缺乏内存隔离。本研究还指出了当前存在的局限性,并提出了改善共享GPU时序可预测性的潜在研究方向。