GPU-enabled Function-as-a-Service for Machine Learning Inference

Function-as-a-Service (FaaS) is emerging as an important cloud computing service model as it can improve the scalability and usability of a wide range of applications, especially Machine-Learning (ML) inference tasks that require scalable resources and complex software configurations. These inference tasks heavily rely on GPUs to achieve high performance; however, support for GPUs is currently lacking in the existing FaaS solutions. The unique event-triggered and short-lived nature of functions poses new challenges to enabling GPUs on FaaS, which must consider the overhead of transferring data (e.g., ML model parameters and inputs/outputs) between GPU and host memory. This paper proposes a novel GPU-enabled FaaS solution that enables ML inference functions to efficiently utilize GPUs to accelerate their computations. First, it extends existing FaaS frameworks such as OpenFaaS to support the scheduling and execution of functions across GPUs in a FaaS cluster. Second, it provides caching of ML models in GPU memory to improve the performance of model inference functions and global management of GPU memories to improve cache utilization. Third, it offers co-designed GPU function scheduling and cache management to optimize the performance of ML inference functions. Specifically, the paper proposes locality-aware scheduling, which maximizes the utilization of both GPU memory for cache hits and GPU cores for parallel processing. A thorough evaluation based on real-world traces and ML models shows that the proposed GPU-enabled FaaS works well for ML inference tasks, and the proposed locality-aware scheduler achieves a speedup of 48x compared to the default, load balancing only schedulers.

翻译：函数即服务（FaaS）正成为重要的云计算服务模型，因其能提升各类应用的可扩展性与易用性，尤其适用于需要弹性资源和复杂软件配置的机器学习推理任务。此类推理任务高度依赖GPU实现高性能，然而现有FaaS方案普遍缺乏GPU支持。函数事件触发和短生命周期的特性要求我们在启用GPU时必须权衡数据（如模型参数、输入/输出）在GPU与主机内存间的传输开销。本文提出一种新型GPU加速FaaS方案，使机器学习推理函数能够高效利用GPU加速计算。首先，该方案扩展了OpenFaaS等现有FaaS框架，支持在FaaS集群中跨GPU调度和执行函数。其次，通过在GPU内存中缓存机器学习模型以提升推理函数性能，并实现GPU内存的全局管理以提高缓存利用率。第三，该方案提供协同设计的GPU函数调度与缓存管理机制，优化推理函数的性能表现。具体而言，论文提出局部感知调度策略，通过最大化GPU内存缓存命中率与GPU核心并行处理能力的双重利用来提升效能。基于真实负载特征与机器学习模型的全面评估表明，所提出的GPU加速FaaS能有效支撑ML推理任务，且局部感知调度器相较仅关注负载均衡的默认调度器实现了48倍加速。