Serverless computing has become increasingly popular for machine learning inference. However, current serverless platforms lack efficient support for GPUs, limiting their ability to deliver low-latency inference. In this paper, we propose FaaSwap, a GPU-efficient serverless inference platform. FaaSwap employs a holistic approach to system and algorithm design. It maintains models in main memory and dynamically swaps them onto GPUs upon request arrivals (i.e., late binding), thereby enabling a large number of inference functions to efficiently share a node's GPUs. FaaSwap uses various techniques, including asynchronous API redirection, GPU runtime sharing, pipelined model execution, and efficient GPU memory management, to achieve the optimal performance. We also develop an interference-aware request scheduling algorithm that allows FaaSwap to meet the latency SLOs for individual inference functions. We have implemented FaaSwap as a prototype on a leading commercial serverless platform. Experimental evaluations demonstrate that, with model swapping, FaaSwap can concurrently serve hundreds of functions on a single worker node with 4 V100 GPUs, while achieving inference performance comparable to native execution (where each function runs on a dedicated GPU). When deployed on a 6-node production testbed, FaaSwap meets the latency SLOs for over 1k functions, the maximum that the testbed can handle concurrently.
翻译:无服务器计算在机器学习推理中日益流行。然而,当前的serverless平台缺乏对GPU的高效支持,限制了其提供低延迟推理的能力。本文提出FaaSwap——一个GPU高效的serverless推理平台。FaaSwap采用系统与算法设计的整体性方法:将模型驻留在主存中,并在请求到达时动态交换到GPU上(即延迟绑定),从而使得大量推理函数能够高效共享节点的GPU资源。FaaSwap综合运用了异步API重定向、GPU运行时共享、流水线化模型执行以及高效的GPU内存管理等技术来实现最优性能。我们还开发了一种干扰感知的请求调度算法,使FaaSwap能够满足单个推理函数的延迟SLO。我们已在领先的商业serverless平台上完成了FaaSwap原型实现。实验评估表明,通过模型交换,FaaSwap可在配备4块V100 GPU的单工作节点上并发服务数百个函数,同时实现与原生执行(即每个函数独占GPU运行)相当的推理性能。当部署在包含6节点的生产测试平台上时,FaaSwap能够满足超过1000个函数(该测试平台所能处理的最大并发数)的延迟SLO。