Mobile and IoT applications increasingly adopt deep learning inference to provide intelligence. Inference requests are typically sent to a cloud infrastructure over a wireless network that is highly variable, leading to the challenge of dynamic Service Level Objectives (SLOs) at the request level. This paper presents Sponge, a novel deep learning inference serving system that maximizes resource efficiency while guaranteeing dynamic SLOs. Sponge achieves its goal by applying in-place vertical scaling, dynamic batching, and request reordering. Specifically, we introduce an Integer Programming formulation to capture the resource allocation problem, providing a mathematical model of the relationship between latency, batch size, and resources. We demonstrate the potential of Sponge through a prototype implementation and preliminary experiments and discuss future works.
翻译:移动端及物联网应用日益采用深度学习推理来提供智能能力。推理请求通常通过高度可变的无线网络传输至云基础设施,这导致请求层面面临动态服务等级目标(SLOs)的挑战。本文提出Sponge——一种新型深度学习推理服务系统,在保障动态SLOs的前提下最大化资源效率。Sponge通过应用原位垂直扩展、动态批处理及请求重排序实现目标。具体而言,我们引入整数规划模型来刻画资源分配问题,提供时延、批处理大小与资源之间关系的数学建模。通过原型实现与初步实验,我们展示了Sponge的潜力并讨论了未来工作方向。