NVIDIA Multi-Process Service (MPS) enables fine-grained GPU sharing by allowing multiple processes to execute concurrently on the same GPU, making it an important mechanism for improving GPU utilization. However, MPS has weak fault resilience: a fault in one process can terminate all co-running processes, limiting its adoption in resilience-critical settings such as multi-tenant GPU clusters. In this work, we design fault-resilient MPS to solve this problem. Our design is guided by insights from a systematic characterization of GPU faults and a deep analysis of their end-to-end processing pipeline. Based on these insights, we design two complementary mechanisms. A fault isolation mechanism for the dominant memory-related faults that can be fully isolated by software intervention in the open GPU driver kernel module. For other faults whose process is within proprietary software, we design a practical mechanism -- fast recovery using virtual memory based GPU-resident state sharing. Our evaluation on different GPUs and workloads shows that these mechanisms can handle corresponding faults effectively with minimal overhead.
翻译:NVIDIA多进程服务(MPS)通过允许在同一GPU上并发执行多个进程来实现细粒度的GPU资源共享,是提升GPU利用率的重要机制。然而,MPS的容错能力较弱:单个进程中的故障可能导致所有协同运行的进程终止,这限制了其在多租户GPU集群等容错关键场景中的应用。本文针对该问题设计了具有容错能力的MPS系统。我们的设计基于对GPU故障的系统性特征分析和端到端处理流程的深度剖析。基于这些分析,我们提出了两种互补机制:针对可通过开放GPU驱动内核模块中软件干预实现完全隔离的主流内存相关故障,设计了一种故障隔离机制;对于进程处理涉及专有软件的其他故障,我们提出了一种实用机制——利用基于虚拟内存的GPU驻留状态共享实现快速恢复。在不同GPU和工作负载上的评估表明,这些机制能以极低开销有效处理相应故障。