Serverless computing has gained significant traction for machine learning inference applications, which are often deployed as serverless workflows consisting of multiple CPU and GPU functions with data dependency. However, existing data-passing solutions for serverless computing primarily reply on host memory for fast data transfer, mandating substantial data movement and resulting in salient I/O overhead. In this paper, we present FaaSTube, a GPU-efficient data passing system for serverless inference. FaaSTube manages intermediate data within a GPU memory pool to facilitate direct data exchange between GPU functions. It enables fine-grained bandwidth sharing over PCIe and NVLink, minimizing data-passing latency for both host-to-GPU and GPU-to-GPU while providing performance isolation between functions. Additionally, FaaSTube implements an elastic GPU memory pool that dynamically scales to accommodate varying data-passing demands. Evaluations on real-world applications show that FaaSTube reduces end-to-end latency by up to 90\% and achieves up to 12x higher throughput compared to the state-of-the-art.
翻译:无服务器计算在机器学习推理应用中获得了广泛关注,这类应用通常部署为由多个具有数据依赖性的CPU和GPU函数组成的无服务器工作流。然而,现有的无服务器计算数据传递方案主要依赖主机内存实现快速数据传输,这导致大量数据移动并产生显著的I/O开销。本文提出FaaSTube,一种面向无服务器推理的GPU高效数据传递系统。FaaSTube在GPU内存池中管理中间数据,以促进GPU函数间的直接数据交换。该系统支持在PCIe和NVLink上实现细粒度带宽共享,在为主机到GPU和GPU到GPU数据传输最小化数据传递延迟的同时,提供函数间的性能隔离。此外,FaaSTube实现了弹性GPU内存池,可根据数据传递需求动态扩展容量。在实际应用中的评估表明,相较于现有最优方案,FaaSTube能将端到端延迟降低高达90%,并实现高达12倍的吞吐量提升。