Streamlining the deployment of Deep Neural Networks (DNNs) on heterogeneous edge platforms, coupling within the same micro-controller unit (MCU) instruction processors and hardware accelerators for tensor computations, is becoming one of the crucial challenges of the TinyML field. The best-performing DNN compilation toolchains are usually deeply customized for a single MCU family, and porting to a different heterogeneous MCU family implies labor-intensive re-development of almost the entire compiler. On the opposite side, retargetable toolchains, such as TVM, fail to exploit the capabilities of custom accelerators, resulting in the generation of general but unoptimized code. To overcome this duality, we introduce MATCH, a novel TVM-based DNN deployment framework designed for easy agile retargeting across different MCU processors and accelerators, thanks to a customizable model-based hardware abstraction. We show that a general and retargetable mapping framework enhanced with hardware cost models can compete with and even outperform custom toolchains on diverse targets while only needing the definition of an abstract hardware model and a SoC-specific API. We tested MATCH on two state-of-the-art heterogeneous MCUs, GAP9 and DIANA. On the four DNN models of the MLPerf Tiny suite MATCH reduces inference latency by up to 60.88 times on DIANA, compared to using the plain TVM, thanks to the exploitation of the on-board HW accelerator. Compared to HTVM, a fully customized toolchain for DIANA, we still reduce the latency by 16.94%. On GAP9, using the same benchmarks, we improve the latency by 2.15 times compared to the dedicated DORY compiler, thanks to our heterogeneous DNN mapping approach that synergically exploits the DNN accelerator and the eight-cores cluster available on board.
翻译:在异构边缘平台上简化深度神经网络(DNN)的部署,正成为TinyML领域的关键挑战之一。这些平台在同一微控制器单元(MCU)内集成了指令处理器和用于张量计算的硬件加速器。性能最佳的DNN编译工具链通常针对单一MCU系列进行了深度定制,移植到不同的异构MCU系列意味着几乎需要重新开发整个编译器,工作量巨大。另一方面,可重定向工具链(如TVM)未能充分利用定制加速器的能力,导致生成的代码通用但未优化。为克服这种二元对立,我们提出了MATCH,这是一种基于TVM的新型DNN部署框架,旨在通过可定制的基于模型的硬件抽象,轻松实现跨不同MCU处理器和加速器的敏捷重定向。我们证明,通过硬件成本模型增强的通用可重定向映射框架,能够在多种目标平台上与定制工具链竞争甚至超越其性能,同时仅需定义抽象硬件模型和特定SoC的API。我们在两种先进的异构MCU(GAP9和DIANA)上测试了MATCH。在MLPerf Tiny套件的四个DNN模型上,MATCH通过利用板载硬件加速器,与使用原始TVM相比,在DIANA上推理延迟最高降低了60.88倍。与为DIANA完全定制的工具链HTVM相比,我们仍将延迟降低了16.94%。在GAP9上,使用相同基准测试,得益于我们异构DNN映射方法协同利用板载DNN加速器和八核集群,与专用DORY编译器相比,我们将延迟提高了2.15倍。