Modern GPU applications, such as machine learning (ML), can only partially utilize GPUs, leading to GPU underutilization in cloud environments. Sharing GPUs across multiple applications from different tenants can improve resource utilization and consequently cost, energy, and power efficiency. However, GPU sharing creates memory safety concerns because kernels must share a single GPU address space. Existing spatial-sharing mechanisms either lack fault isolation for memory accesses or require static partitioning, which leads to limited deployability or low utilization. In this paper, we present Guardian, a PTX-level bounds checking approach that provides memory isolation and supports dynamic GPU spatial-sharing. Guardian relies on three mechanisms: (1) It divides the common GPU address space into separate partitions for different applications. (2) It intercepts and checks all GPU related calls at the lowest level, fencing erroneous operations. (3) It instruments all GPU kernels at the PTX level -- available in closed GPU libraries -- fencing all kernel memory accesses outside application memory bounds. Guardian's approach is transparent to applications and supports real-life frameworks, such as Caffe and PyTorch, that issue billions of GPU kernels. Our evaluation shows that Guardian's overhead compared to native for such frameworks is between 4% - 12% and on average 9%.
翻译:现代GPU应用(如机器学习)通常无法充分利用GPU资源,导致云环境中GPU利用率低下。在不同租户的多个应用间共享GPU可提高资源利用率,从而改善成本、能耗与能效。然而,GPU共享会引发内存安全问题,因为计算内核必须共享单一GPU地址空间。现有的空间共享机制要么缺乏内存访问的故障隔离能力,要么需要静态分区,导致部署受限或利用率低下。本文提出Guardian——一种在PTX层面实现边界检查的方法,既能提供内存隔离,又支持动态GPU空间共享。Guardian依赖三大机制:(1)将公共GPU地址空间划分为不同应用的独立分区;(2)在最底层拦截并检查所有GPU相关调用,隔离错误操作;(3)在PTX层面对所有GPU计算内核进行插桩(包括闭源GPU库中的内核),隔离所有超出应用内存边界的核内存访问。Guardian对应用透明,且支持实际生产框架(如Caffe和PyTorch)中数十亿量级的GPU内核调用。评估表明,在此类框架中Guardian相比原生执行的性能开销介于4%-12%之间,平均为9%。