Driven by the wide adoption of deep neural networks (DNNs) across different application domains, multi-tenancy execution, where multiple DNNs are deployed simultaneously on the same hardware, has been proposed to satisfy the latency requirements of different applications while improving the overall system utilization. However, multi-tenancy execution could lead to undesired system-level resource contention, causing quality-of-service (QoS) degradation for latency-critical applications. To address this challenge, we propose MoCA, an adaptive multi-tenancy system for DNN accelerators. Unlike existing solutions that focus on compute resource partition, MoCA dynamically manages shared memory resources of co-located applications to meet their QoS targets. Specifically, MoCA leverages the regularities in both DNN operators and accelerators to dynamically modulate memory access rates based on their latency targets and user-defined priorities so that co-located applications get the resources they demand without significantly starving their co-runners. We demonstrate that MoCA improves the satisfaction rate of the service level agreement (SLA) up to 3.9x (1.8x average), system throughput by 2.3x (1.7x average), and fairness by 1.3x (1.2x average), compared to prior work.
翻译:随着深度神经网络(DNN)在不同应用领域的广泛采用,多租户执行(即在同一硬件上同时部署多个DNN)被提出,以满足不同应用的延迟要求并提升系统整体利用率。然而,多租户执行可能导致不期望的系统级资源争用,从而造成对延迟关键型应用的服务质量(QoS)下降。为应对这一挑战,我们提出MoCA——一种适用于DNN加速器的自适应多租户系统。与现有侧重于计算资源划分的解决方案不同,MoCA通过动态管理共置应用的共享内存资源来满足其QoS目标。具体而言,MoCA利用DNN算子与加速器中的规律性,根据其延迟目标和用户定义优先级动态调节内存访问速率,使得共置应用能够获取所需资源而不至于严重拖累其并行任务。我们证明,与现有工作相比,MoCA将服务等级协议(SLA)满意度提升高达3.9倍(平均1.8倍),系统吞吐量提升2.3倍(平均1.7倍),公平性提升1.3倍(平均1.2倍)。