Machine learning (ML) models are increasingly deployed to production, calling for efficient inference serving systems. Efficient inference serving is complicated by two challenges: (i) ML models incur high computational costs, and (ii) the request arrival rates of practical applications have frequent, high, and sudden variations which make it hard to correctly provision hardware. Model cascades are positioned to tackle both of these challenges, as they (i) save work while maintaining accuracy, and (ii) expose a high-resolution trade-off between work and accuracy, allowing for fine-grained adjustments to request arrival rates. Despite their potential, model cascades haven't been used inside an online serving system. This comes with its own set of challenges, including workload adaption, model replication onto hardware, inference scheduling, request batching, and more. In this work, we propose CascadeServe, which automates and optimizes end-to-end inference serving with cascades. CascadeServe operates in an offline and online phase. In the offline phase, the system pre-computes a gear plan that specifies how to serve inferences online. In the online phase, the gear plan allows the system to serve inferences while making near-optimal adaptations to the query load at negligible decision overheads. We find that CascadeServe saves 2-3x in cost across a wide spectrum of the latency-accuracy space when compared to state-of-the-art baselines on different workloads.
翻译:机器学习(ML)模型正日益部署到生产环境中,这要求高效的推理服务系统。高效的推理服务面临两大挑战:(i)ML模型计算成本高昂;(ii)实际应用的请求到达率存在频繁、剧烈且突然的变化,使得硬件资源难以准确配置。模型级联技术恰好能应对这两项挑战,因为它(i)能在保持精度的同时节省计算量,且(ii)在计算量与精度之间提供了高分辨率的权衡,允许对请求到达率进行细粒度调整。尽管具有潜力,模型级联尚未在在线服务系统中得到应用。这本身带来了一系列挑战,包括工作负载适应、模型在硬件上的复制、推理调度、请求批处理等。在本工作中,我们提出CascadeServe,该系统自动化并优化了基于级联的端到端推理服务。CascadeServe分为离线和在线两个阶段。在离线阶段,系统预计算一个“档位计划”,明确在线推理服务的具体方式。在线阶段,该档位计划使系统能够在可忽略的决策开销下,根据查询负载做出接近最优的调整以提供服务。我们发现,在不同工作负载上与最先进的基线方法相比,CascadeServe在广泛的延迟-精度范围内可节省2-3倍的成本。