Serverless computing has emerged as a pivotal paradigm for deploying Deep Learning (DL) models, offering automatic scaling and cost efficiency. However, the inherent cold start problem in serverless ML inference systems, particularly the time-consuming model loading process, remains a significant bottleneck. Utilizing pipelined model loading improves efficiency but still suffer from pipeline stalls due to sequential layer construction and monolithic weight loading. In this paper, we propose \textit{Cicada}, a novel pipeline optimization framework that coordinates computational, storage, and scheduling resources through three key mechanisms: (1) \textit{MiniLoader}: which reduces layer construction overhead by opportunistically optimizing parameter initialization; (2) \textit{WeightDecoupler}: decoupling weight file processing from layer construction, enabling asynchronous weight retrieval and out-of-order weight application; (3) \textit{Priority-Aware Scheduler}: dynamically allocating resources to ensure high-priority inference tasks are executed promptly. Our experimental results demonstrate that Cicada achieves significant performance improvements over the state-of-the-art PISeL framework. Specifically, Cicada reduces end-to-end inference latency by an average of 61.59\%, with the MiniLoader component contributing the majority of this optimization (53.41\%), and the WeightDecoupler achieves up to 26.17\% improvement. Additionally, Cicada achieves up to 2.52x speedup in the inference pipeline utlization compared to PISeL.
翻译:无服务器计算已成为部署深度学习模型的关键范式,其提供了自动扩展和成本效益。然而,无服务器机器学习推理系统中固有的冷启动问题,特别是耗时的模型加载过程,仍然是一个显著的瓶颈。采用流水线模型加载提高了效率,但由于顺序层构建和整体权重加载,仍然存在流水线停滞问题。本文提出\textit{Cicada},一种新颖的流水线优化框架,通过三种关键机制协调计算、存储和调度资源:(1) \textit{MiniLoader}:通过机会性地优化参数初始化来减少层构建开销;(2) \textit{WeightDecoupler}:将权重文件处理与层构建解耦,实现异步权重检索和乱序权重应用;(3) \textit{优先级感知调度器}:动态分配资源以确保高优先级推理任务得到及时执行。我们的实验结果表明,Cicada相比最先进的PISeL框架实现了显著的性能提升。具体而言,Cicada将端到端推理延迟平均降低了61.59\%,其中MiniLoader组件贡献了大部分优化(53.41\%),而WeightDecoupler实现了高达26.17\%的改进。此外,与PISeL相比,Cicada在推理流水线利用率上实现了高达2.52倍的加速。