FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference

Serverless computing (FaaS) has been extensively utilized for deep learning (DL) inference due to the ease of deployment and pay-per-use benefits. However, existing FaaS platforms utilize GPUs in a coarse manner for DL inferences, without taking into account spatio-temporal resource multiplexing and isolation, which results in severe GPU under-utilization, high usage expenses, and SLO (Service Level Objectives) violation. There is an imperative need to enable an efficient and SLO-aware GPU-sharing mechanism in serverless computing to facilitate cost-effective DL inferences. In this paper, we propose \textbf{FaST-GShare}, an efficient \textit{\textbf{Fa}aS-oriented \textbf{S}patio-\textbf{T}emporal \textbf{G}PU \textbf{Sharing}} architecture for deep learning inferences. In the architecture, we introduce the FaST-Manager to limit and isolate spatio-temporal resources for GPU multiplexing. In order to realize function performance, the automatic and flexible FaST-Profiler is proposed to profile function throughput under various resource allocations. Based on the profiling data and the isolation mechanism, we introduce the FaST-Scheduler with heuristic auto-scaling and efficient resource allocation to guarantee function SLOs. Meanwhile, FaST-Scheduler schedules function with efficient GPU node selection to maximize GPU usage. Furthermore, model sharing is exploited to mitigate memory contention. Our prototype implementation on the OpenFaaS platform and experiments on MLPerf-based benchmark prove that FaST-GShare can ensure resource isolation and function SLOs. Compared to the time sharing mechanism, FaST-GShare can improve throughput by 3.15x, GPU utilization by 1.34x, and SM (Streaming Multiprocessor) occupancy by 3.13x on average.

翻译：无服务器计算（FaaS）因其部署便捷和按需付费的优势，已被广泛应用于深度学习推理任务。然而，现有FaaS平台在支持深度学习推理时对GPU采用粗粒度管理方式，未能实现时空资源的复用与隔离，导致GPU利用率严重低下、使用成本高昂，并引发服务等级目标（SLO）违规。因此，亟需在无服务器计算中构建一种高效且感知SLO的GPU共享机制，以实现成本效益的深度学习推理。本文提出\textbf{FaST-GShare}——一种面向深度学习推理的高效\textbf{无服务器计算导向的时空GPU共享}架构。该架构中，我们引入FaST-Manager以限制并隔离时空资源，实现GPU多路复用。为感知函数性能，提出自动化且灵活的FaST-Profiler，用于分析不同资源分配条件下的函数吞吐量。基于性能剖析数据与隔离机制，我们设计了FaST-Scheduler，通过启发式自动缩放和高效资源分配保障函数SLO。同时，FaST-Scheduler通过高效GPU节点选择进行函数调度，以最大化GPU利用率。此外，利用模型共享缓解内存竞争。在OpenFaaS平台上的原型实现及基于MLPerf基准的实验表明，FaST-GShare能够确保资源隔离与函数SLO。与时间共享机制相比，FaST-GShare平均可提升吞吐量3.15倍、GPU利用率1.34倍、流多处理器（SM）占用率3.13倍。