Dynamic DNNs and Runtime Management for Efficient Inference on Mobile/Embedded Devices

Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to several key advantages in latency, privacy and always-on availability. However, due to limited computing resources, efficient DNN deployment on mobile and embedded platforms is challenging. Although many hardware accelerators and static model compression methods were proposed by previous works, at system runtime, multiple applications are typically executed concurrently and compete for hardware resources. This raises two main challenges: Runtime Hardware Availability and Runtime Application Variability. Previous works have addressed these challenges through either dynamic neural networks that contain sub-networks with different performance trade-offs or runtime hardware resource management. In this thesis, we proposed a combined method, a system was developed for DNN performance trade-off management, combining the runtime trade-off opportunities in both algorithms and hardware to meet dynamically changing application performance targets and hardware constraints in real time. We co-designed novel Dynamic Super-Networks to maximise runtime system-level performance and energy efficiency on heterogeneous hardware platforms. Compared with SOTA, our experimental results using ImageNet on the GPU of Jetson Xavier NX show our model is 2.4x faster for similar ImageNet Top-1 accuracy, or 5.1% higher accuracy at similar latency. We also designed a hierarchical runtime resource manager that tunes both dynamic neural networks and DVFS at runtime. Compared with the Linux DVFS governor schedutil, our runtime approach achieves up to a 19% energy reduction and a 9% latency reduction in single model deployment scenario, and an 89% energy reduction and a 23% latency reduction in a two concurrent model deployment scenario.

翻译：深度神经网络推理因其在延迟、隐私和持续可用性方面的关键优势，正越来越多地部署于移动与嵌入式平台。然而，有限的计算资源使得在这些平台上高效部署DNN面临挑战。尽管先前工作提出了多种硬件加速器和静态模型压缩方法，但在系统运行时，多个应用程序通常会并发执行并竞争硬件资源，由此引发两大挑战：运行时硬件可用性与运行时应用变化性。现有工作通过包含不同性能权衡子网络的动态神经网络或运行时硬件资源管理应对此类挑战。本论文提出一种结合算法与硬件运行时权衡机会的联合方法，开发了一套DNN性能权衡管理系统，能够实时满足动态变化的应用程序性能目标与硬件约束。我们协同设计了新型动态超网络，以最大化异构硬件平台上的运行时系统级性能与能效。在Jetson Xavier NX的GPU上基于ImageNet的实验结果表明，与当前最优方法相比，我们的模型在相似ImageNet Top-1精度下速度提升2.4倍，或在相似延迟下精度提升5.1%。我们还设计了一个分层运行时资源管理器，可同时调节动态神经网络与DVFS。与Linux DVFS调控器schedutil相比，在单模型部署场景中，我们的运行时方法实现了最高19%的能耗降低和9%的延迟降低；在双模型并发部署场景中，则实现了89%的能耗降低和23%的延迟降低。