Reduced Precision Floating-Point Optimization for Deep Neural Network On-Device Learning on MicroControllers

from arxiv, Pre-print version submitted to Elsevier's Future Generation Computer Systems journal. For the associated open-source release, see https://github.com/pulp-platform/pulp-trainlib

Enabling On-Device Learning (ODL) for Ultra-Low-Power Micro-Controller Units (MCUs) is a key step for post-deployment adaptation and fine-tuning of Deep Neural Network (DNN) models in future TinyML applications. This paper tackles this challenge by introducing a novel reduced precision optimization technique for ODL primitives on MCU-class devices, leveraging the State-of-Art advancements in RISC-V RV32 architectures with support for vectorized 16-bit floating-point (FP16) Single-Instruction Multiple-Data (SIMD) operations. Our approach for the Forward and Backward steps of the Back-Propagation training algorithm is composed of specialized shape transform operators and Matrix Multiplication (MM) kernels, accelerated with parallelization and loop unrolling. When evaluated on a single training step of a 2D Convolution layer, the SIMD-optimized FP16 primitives result up to 1.72$\times$ faster than the FP32 baseline on a RISC-V-based 8+1-core MCU. An average computing efficiency of 3.11 Multiply and Accumulate operations per clock cycle (MAC/clk) and 0.81 MAC/clk is measured for the end-to-end training tasks of a ResNet8 and a DS-CNN for Image Classification and Keyword Spotting, respectively -- requiring 17.1 ms and 6.4 ms on the target platform to compute a training step on a single sample. Overall, our approach results more than two orders of magnitude faster than existing ODL software frameworks for single-core MCUs and outperforms by 1.6 $\times$ previous FP32 parallel implementations on a Continual Learning setup.

翻译：在超低功耗微控制器单元（MCU）上实现设备端学习（ODL）是未来TinyML应用中深度神经网络（DNN）模型部署后自适应与微调的关键步骤。本文通过引入一种针对MCU级设备ODL基元的降精度优化新技术应对这一挑战，该技术利用RISC-V RV32架构的最新进展，支持向量化16位浮点（FP16）单指令多数据（SIMD）操作。针对反向传播训练算法的前向与反向步骤，我们提出的方法由专用形状变换算子与矩阵乘法（MM）内核构成，并通过并行化与循环展开实现加速。在单个2D卷积层的训练步骤评估中，基于RISC-V的8+1核MCU上，SIMD优化的FP16基元执行速度相比FP32基线提升达1.72倍。在面向图像分类的ResNet8与关键词识别的DS-CNN端到端训练任务中，平均计算效率分别达到每时钟周期3.11次乘累加操作（MAC/clk）和0.81 MAC/clk——在目标平台上对单个样本计算一个训练步骤分别仅需17.1毫秒与6.4毫秒。总体而言，我们的方法相比现有面向单核MCU的ODL软件框架提速超过两个数量级，并在持续学习场景下较之前FP32并行实现性能提升1.6倍。