The acceleration of pruned Deep Neural Networks (DNNs) on edge devices such as Microcontrollers (MCUs) is a challenging task, given the tight area- and power-constraints of these devices. In this work, we propose a three-fold contribution to address this problem. First, we design a set of optimized software kernels for N:M pruned layers, targeting ultra-low-power, multicore RISC-V MCUs, which are up to 2.1x and 3.4x faster than their dense counterparts at 1:8 and 1:16 sparsity, respectively. Then, we implement a lightweight Instruction-Set Architecture (ISA) extension to accelerate the indirect load and non-zero indices decompression operations required by our kernels, obtaining up to 1.9x extra speedup, at the cost of a 5% area overhead. Lastly, we extend an open-source DNN compiler to utilize our sparse kernels for complete networks, showing speedups of 3.21x and 1.81x on a ResNet18 and a Vision Transformer (ViT), with less than 1.5% accuracy drop compared to a dense baseline.
翻译:在微控制器等边缘设备上加速剪枝后的深度神经网络是一项极具挑战性的任务,这主要源于此类设备严格的面积与功耗限制。本研究提出了一套三方面的解决方案以应对此问题。首先,我们针对超低功耗多核RISC-V微控制器,设计了一组面向N:M剪枝层的优化软件内核。该内核在1:8和1:16稀疏度下,分别比其稠密版本快2.1倍和3.4倍。其次,我们实现了一种轻量级的指令集架构扩展,用以加速我们内核所需的间接加载和非零索引解压缩操作,在仅增加5%面积开销的情况下,获得了最高达1.9倍的额外加速。最后,我们扩展了一个开源DNN编译器,使其能够为完整网络利用我们的稀疏内核。实验表明,在ResNet18和Vision Transformer上分别实现了3.21倍和1.81倍的加速,同时与稠密基线相比,精度损失低于1.5%。