With deep neural networks (DNNs) emerging as the backbone in a multitude of computer vision tasks, their adoption in real-world applications broadens continuously. Given the abundance and omnipresence of smart devices in the consumer landscape, "smart ecosystems'' are being formed where sensing happens concurrently rather than standalone. This is shifting the on-device inference paradigm towards deploying centralised neural processing units (NPUs) at the edge, where multiple devices (e.g. in smart homes or autonomous vehicles) can stream their data for processing with dynamic rates. While this provides enhanced potential for input batching, naive solutions can lead to subpar performance and quality of experience, especially under spiking loads. At the same time, the deployment of dynamic DNNs, comprising stochastic computation graphs (e.g. early-exit (EE) models), introduces a new dimension of dynamic behaviour in such systems. In this work, we propose a novel early-exit-aware scheduling algorithm that allows sample preemption at run time, to account for the dynamicity introduced both by the arrival and early-exiting processes. At the same time, we introduce two novel dimensions to the design space of the NPU hardware architecture, namely Fluid Batching and Stackable Processing Elements, that enable run-time adaptability to different batch sizes and significantly improve the NPU utilisation even at small batches. Our evaluation shows that the proposed system achieves an average 1.97x and 6.7x improvement over state-of-the-art DNN streaming systems in terms of average latency and tail latency service-level objective (SLO) satisfaction, respectively.
翻译:随着深度神经网络成为众多计算机视觉任务的骨干,其在现实世界应用中的采用持续扩展。鉴于消费领域中智能设备的广泛普及和无处不在的特性,"智能生态系统"正在形成,其中感知是并发而非独立进行的。这正将设备端推理范式转向在边缘部署集中式神经处理单元(NPU),使多个设备(如智能家居或自动驾驶车辆中的设备)能够以动态速率传输数据以供处理。虽然这增强了输入批处理的潜力,但简单化的解决方案可能导致次优的性能和体验质量,尤其是在负载骤增的情况下。同时,采用包含随机计算图(如早期退出模型)的动态DNN部署,为此类系统引入了新的动态行为维度。在本工作中,我们提出了一种新颖的早期退出感知调度算法,允许运行时样本抢占,以应对由到达过程和早期退出过程引入的动态性。与此同时,我们为NPU硬件架构的设计空间引入了两个新维度,即流体批量处理和可堆叠处理元件,使运行时能够适应不同的批量大小,并显著提升NPU利用率,即使在小型批量下也是如此。评估表明,所提系统在平均延迟和尾部延迟服务等级协议(SLO)满意度方面,相较于现有最优的DNN流式系统,分别实现了平均1.97倍和6.7倍的提升。