Mobile and IoT applications increasingly adopt deep learning inference to provide intelligence. Inference requests are typically sent to a cloud infrastructure over a wireless network that is highly variable, leading to the challenge of dynamic Service Level Objectives (SLOs) at the request level. This paper presents Sponge, a novel deep learning inference serving system that maximizes resource efficiency while guaranteeing dynamic SLOs. Sponge achieves its goal by applying in-place vertical scaling, dynamic batching, and request reordering. Specifically, we introduce an Integer Programming formulation to capture the resource allocation problem, providing a mathematical model of the relationship between latency, batch size, and resources. We demonstrate the potential of Sponge through a prototype implementation and preliminary experiments and discuss future works.
翻译:移动和物联网应用日益采用深度学习推理来提供智能功能。推理请求通常通过高度可变的无线网络发送至云基础设施,这导致请求层面出现动态服务等级目标(SLOs)的挑战。本文提出Sponge——一种新颖的深度学习推理服务系统,在保障动态SLOs的同时最大化资源效率。Sponge通过应用原地垂直扩展、动态批处理和请求重排序实现其目标。具体而言,我们引入整数规划公式来描述资源分配问题,建立了延迟、批处理大小与资源之间关系的数学模型。我们通过原型实现和初步实验展示了Sponge的潜力,并讨论了未来工作方向。