Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However, existing systems suffer from excessive module latency during request dispatching, low execution throughput during module scheduling, and wasted latency budget during latency splitting for multi-DNN application, which undermines their capability to minimize the serving cost. In this paper, we design a DNN inference system called Harpagon, which minimizes the serving cost under latency constraints with a three-level design. It first maximizes the batch collection rate with a batch-aware request dispatch policy to minimize the module latency. It then maximizes the module throughput with multi-tuple configurations and proper amount of dummy requests. It also carefully splits the end-to-end latency into per-module latency budget to minimize the total serving cost for multi-DNN applications. Evaluation shows that Harpagon outperforms the state of the art by 1.49 to 2.37 times in serving cost while satisfying the latency objectives. Additionally, compared to the optimal solution using brute force search, Harpagon derives the lower bound of serving cost for 91.5% workloads with millisecond level runtime.
翻译:深度神经网络(DNN)的进展显著推动了实时视频处理应用的发展。在云端托管推理系统中高效调度DNN工作负载,对于在满足应用延迟约束的同时最小化服务成本至关重要。然而,现有系统存在请求调度时的模块延迟过高、模块编排时的执行吞吐量低下,以及多DNN应用延迟预算切分时的延迟预算浪费等问题,这削弱了其最小化服务成本的能力。本文设计了一种名为Harpagon的DNN推理系统,该系统通过三级设计在延迟约束下最小化服务成本。首先,它采用批次感知的请求调度策略最大化批次收集率以最小化模块延迟。其次,通过多组配置和适量的虚拟请求最大化模块吞吐量。此外,系统还精心将端到端延迟切分为每模块延迟预算,以最小化多DNN应用的总服务成本。评估表明,在满足延迟目标的前提下,Harpagon的服务成本性能优于现有最优方案1.49至2.37倍。另外,与使用暴力搜索的最优解相比,Harpagon能以毫秒级运行时间为91.5%的工作负载推导出服务成本的下界。