SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads

The increasing deployment of ML models on the critical path of production applications in both datacenter and the edge requires ML inference serving systems to serve these models under unpredictable and bursty request arrival rates. Serving models under such conditions requires these systems to strike a careful balance between the latency and accuracy requirements of the application and the overall efficiency of utilization of scarce resources. State-of-the-art systems resolve this tension by either choosing a static point in the latency-accuracy tradeoff space to serve all requests or load specific models on the critical path of request serving. In this work, we instead resolve this tension by simultaneously serving the entire-range of models spanning the latency-accuracy tradeoff space. Our novel mechanism, SubNetAct, achieves this by carefully inserting specialized operators in weight-shared SuperNetworks. These operators enable SubNetAct to dynamically route requests through the network to meet a latency and accuracy target. SubNetAct requires upto 2.6x lower memory to serve a vastly-higher number of models than prior state-of-the-art. In addition, SubNetAct's near-instantaneous actuation of models unlocks the design space of fine-grained, reactive scheduling policies. We explore the design of one such extremely effective policy, SlackFit and instantiate both SubNetAct and SlackFit in a real system, SuperServe. SuperServe achieves 4.67% higher accuracy for the same SLO attainment and 2.85x higher SLO attainment for the same accuracy on a trace derived from the real-world Microsoft Azure Functions workload and yields the best trade-offs on a wide range of extremely-bursty synthetic traces automatically.

翻译：随着机器学习模型在数据中心和边缘端生产应用关键路径上的部署日益增多，推理服务系统需在不可预测且突发的请求到达率下提供服务。此类条件要求系统在应用延迟与准确性需求及稀缺资源利用率之间实现精细平衡。现有系统通过选择延迟-准确性权衡空间中的静态点来服务所有请求，或在请求服务的关键路径上加载特定模型。本研究提出一种新方法：通过同时服务覆盖整个延迟-准确性权衡空间的全部模型来解决这一矛盾。我们的创新机制SubNetAct通过在权重共享的超网络中精心插入专用算子来实现这一目标。这些算子使SubNetAct能够动态路由请求通过网络以满足延迟与准确性目标。与现有最优方法相比，SubNetAct在内存占用上降低至2.6倍的同时可服务数量显著更多的模型。此外，SubNetAct的近即时模型切换能力解锁了细粒度响应式调度策略的设计空间。我们探索了其中一种极为有效的策略SlackFit，并在真实系统SuperServe中实例化SubNetAct与SlackFit。基于真实Microsoft Azure Functions工作负载轨迹，SuperServe在相同SLO达成率下准确率提升4.67%，在相同准确率下SLO达成率提升2.85倍，并在广泛极端突发合成轨迹上自动实现最优权衡。