Deploying neural networks on constrained hardware platforms such as 32-bit microcontrollers is a challenging task because of the large memory, computing and energy requirements of their inference process. To tackle these issues, several convolution primitives have been proposed to make the standard convolution more computationally efficient. However, few of these primitives are really implemented for 32-bit microcontrollers. In this work, we collect different state-of-the-art convolutional primitives and propose an implementation for ARM Cortex-M processor family with an open source deployment platform (NNoM). Then, we carry out experimental characterization tests on these implementations. Our benchmark reveals a linear relationship between theoretical MACs and energy consumption. Thus showing the advantages of using computationally efficient primitives like shift convolution. We discuss about the significant reduction in latency and energy consumption due to the use of SIMD instructions and highlight the importance of data reuse in those performance gains. For reproducibility purpose and further experiments, codes and experiments are publicly available.
翻译:在32位微控制器等资源受限硬件平台上部署神经网络是一项具有挑战性的任务,因为其推理过程需要大量内存、计算和能源。为解决这些问题,研究者提出了多种卷积原语以提高标准卷积的计算效率。然而,这些原语中真正在32位微控制器上实现的寥寥无几。本研究收集了多种前沿卷积原语,并基于开源部署平台NNoM,针对ARM Cortex-M系列处理器提出了具体实现方案。随后,我们对这些实现进行了实验特性测试。基准测试揭示了理论乘累加操作数与能耗之间的线性关系,从而证明了使用移位卷积等计算高效原语的优势。我们讨论了单指令多数据流指令在显著降低延迟和能耗方面的作用,并强调了数据重用对此性能提升的重要性。为确保可复现性及便于后续实验,相关代码和实验数据已公开提供。