Deep neural networks (DNNs) have been widely adopted for various mobile inference tasks, yet their ever-increasing computational demands are hindering their deployment on resource-constrained mobile devices. Hybrid deep learning partitions a DNN into two parts and deploys them across the mobile device and a server, aiming to reduce inference latency or prolong battery life of mobile devices. However, such partitioning produces (non-uniform) DNN fragments which are hard to serve efficiently on the server.This paper presents Graft -- an efficient inference serving system for hybrid deep learning with latency service-level objective (SLO) guarantees. Our main insight is to mitigate the non-uniformity by a core concept called DNN re-alignment, allowing multiple heterogeneous DNN fragments to be restructured to share layers. To fully exploit the potential of DNN re-alignment, Graft employs fine-grained GPU resource sharing. Based on that, we propose efficient algorithms for merging, grouping, and re-aligning DNN fragments to maximize request batching opportunities, minimizing resource consumption while guaranteeing the inference latency SLO. We implement a Graft prototype and perform extensive experiments with five types of widely used DNNs and real-world network traces. Our results show that Graft improves resource efficiency by up to 70% compared with the state-of-the-art inference serving systems.
翻译:深度神经网络(DNN)已广泛应用于各类移动推理任务,然而其持续增长的计算需求正阻碍其在资源受限的移动设备上的部署。混合深度学习将DNN划分为两部分,分别部署在移动设备和服务器上,旨在降低推理延迟或延长移动设备电池寿命。然而,这种划分产生了(非均匀的)DNN片段,导致服务器难以高效提供推理服务。本文提出Graft——一种面向混合深度学习、具备延迟服务级别目标(SLO)保障的高效推理服务系统。我们的核心洞察是通过名为DNN重对齐的核心概念缓解非均匀性,使多个异构DNN片段能够被重构以共享网络层。为充分挖掘DNN重对齐的潜力,Graft采用细粒度GPU资源共享。在此基础上,我们提出了用于合并、分组和重对齐DNN片段的高效算法,以最大化请求批处理机会,在保障推理延迟SLO的同时最小化资源消耗。我们实现了Graft原型系统,并使用五种广泛使用的DNN和真实网络轨迹进行了大量实验。结果表明,与最先进的推理服务系统相比,Graft的资源效率最高可提升70%。