The need to execute Deep Neural Networks (DNNs) at low latency and low power at the edge has spurred the development of new heterogeneous Systems-on-Chips (SoCs) encapsulating a diverse set of hardware accelerators. How to optimally map a DNN onto such multi-accelerator systems is an open problem. We propose ODiMO, a hardware-aware tool that performs a fine-grain mapping across different accelerators on-chip, splitting individual layers and executing them in parallel, to reduce inference energy consumption or latency, while taking into account each accelerator's quantization precision to maintain accuracy. Pareto-optimal networks in the accuracy vs. energy or latency space are pursued for three popular dataset/DNN pairs, and deployed on the DIANA heterogeneous ultra-low power edge AI SoC. We show that ODiMO reduces energy/latency by up to 33%/31% with limited accuracy drop (-0.53%/-0.32%) compared to manual heuristic mappings.
翻译:在边缘端以低延迟和低功耗执行深度神经网络(DNN)的需求,催生了集成多种硬件加速器的新型异构片上系统(SoC)。如何将DNN最优地映射到此类多加速器系统仍是一个开放性问题。本文提出ODiMO——一种硬件感知工具,通过片上不同加速器间的细粒度映射,将单个网络层拆分并并行执行,在兼顾各加速器量化精度以维持推理准确性的前提下,降低推理能耗或延迟。针对三类经典数据集/DNN组合,我们生成了准确率-能耗或延迟空间的帕累托最优网络,并将其部署至DIANA异构超低功耗边缘AI SoC。实验表明,与人工启发式映射相比,ODiMO在准确率仅下降0.53%/0.32%的情况下,可分别降低能耗/延迟高达33%/31%。