AI spans from large language models to tiny models running on microcontrollers (MCUs). Extremely memory-efficient model architectures are decisive to fit within an MCU's tiny memory budget e.g., 128kB of RAM. However, inference latency must remain small to fit real-time constraints. An approach to tackle this is patch-based fusion, which aims to optimize data flows across neural network layers. In this paper, we introduce msf-CNN, a novel technique that efficiently finds optimal fusion settings for convolutional neural networks (CNNs) by walking through the fusion solution space represented as a directed acyclic graph. Compared to previous work on CNN fusion for MCUs, msf-CNN identifies a wider set of solutions. We published an implementation of msf-CNN running on various microcontrollers (ARM Cortex-M, RISC-V, ESP32). We show that msf-CNN can achieve inference using 50% less RAM compared to the prior art (MCUNetV2 and StreamNet). We thus demonstrate how msf-CNN offers additional flexibility for system designers.
翻译:人工智能的应用范围从大型语言模型延伸至运行在微控制器(MCU)上的微型模型。极度内存高效的模型架构对于适应 MCU 的微小内存预算(例如 128kB RAM)至关重要。然而,推理延迟必须保持在较低水平以满足实时性约束。解决此问题的一种方法是基于补丁的融合,其旨在优化神经网络层间的数据流。本文中,我们介绍了 msf-CNN,这是一种新颖的技术,它通过遍历表示为有向无环图的融合解空间,高效地为卷积神经网络(CNN)找到最优融合设置。与先前针对 MCU 的 CNN 融合研究相比,msf-CNN 能识别出更广泛的解集。我们发布了可在多种微控制器(ARM Cortex-M、RISC-V、ESP32)上运行的 msf-CNN 实现。实验表明,与现有技术(MCUNetV2 和 StreamNet)相比,msf-CNN 能以减少 50% 的 RAM 使用量完成推理。因此,我们证明了 msf-CNN 如何为系统设计者提供额外的灵活性。