RedMule: A Mixed-Precision Matrix-Matrix Operation Engine for Flexible and Energy-Efficient On-Chip Linear Algebra and TinyML Training Acceleration

The increasing interest in TinyML, i.e., near-sensor machine learning on power budgets of a few tens of mW, is currently pushing toward enabling TinyML-class training as opposed to inference only. Current training algorithms, based on various forms of error and gradient backpropagation, rely on floating-point matrix operations to meet the precision and dynamic range requirements. So far, the energy and power cost of these operations has been considered too high for TinyML scenarios. This paper addresses the open challenge of near-sensor training on a few mW power budget and presents RedMulE - Reduced-Precision Matrix Multiplication Engine, a low-power specialized accelerator conceived for multi-precision floating-point General Matrix-Matrix Operations (GEMM-Ops) acceleration, supporting FP16, as well as hybrid FP8 formats, with {sign, exponent, mantissa}=({1,4,3}, {1,5,2}). We integrate RedMule into a Parallel Ultra-Low-Power (PULP) cluster containing eight energy-efficient RISC-V cores sharing a tightly-coupled data memory and implement the resulting system in a 22 nm technology. At its best efficiency point (@ 470 MHz, 0.65 V), the RedMulE-augmented PULP cluster achieves 755 GFLOPS/W and 920 GFLOPS/W during regular General Matrix-Matrix Multiplication (GEMM), and up to 1.19 TFLOPS/W and 1.67 TFLOPS/W when executing GEMM-Ops, respectively, for FP16 and FP8 input/output tensors. In its best performance point (@ 613 MHz, 0.8 V), RedMulE achieves up to 58.5 GFLOPS and 117 GFLOPS for FP16 and FP8, respectively, with 99.4% utilization of the array of Computing Elements and consuming less than 60 mW on average, thus enabling on-device training of deep learning models in TinyML application scenarios while retaining the flexibility to tackle other classes of common linear algebra problems efficiently.

翻译：摘要：随着TinyML（即功耗预算仅数十毫瓦的近传感器机器学习）领域的关注度日益提升，当前研究正从仅支持推理向实现TinyML类训练能力推进。基于误差与梯度反向传播多种形式的现有训练算法，需依赖浮点矩阵运算以满足精度与动态范围要求。然而，此类运算的能耗与功耗成本此前被认为在TinyML场景中过高。本文解决了数毫瓦功耗预算下近传感器训练这一开放挑战，提出RedMulE（低精度矩阵乘法引擎）——一款面向多精度浮点通用矩阵-矩阵运算（GEMM-Ops）加速的低功耗专用加速器，支持FP16及混合FP8格式（{符号位，指数位，尾数位} = {1,4,3}和{1,5,2}）。我们将RedMule集成至包含八个共享紧耦合数据存储器的能效RISC-V内核的并行超低功耗（PULP）集群中，并采用22纳米工艺实现该系统。在最佳能效点（@ 470 MHz, 0.65 V）下，集成RedMulE的PULP集群在进行常规通用矩阵乘法（GEMM）时达到755 GFLOPS/W和920 GFLOPS/W的能效；执行GEMM-Ops时，对于FP16与FP8输入/输出张量，能效分别高达1.19 TFLOPS/W和1.67 TFLOPS/W。在最佳性能点（@ 613 MHz, 0.8 V）下，RedMulE对FP16和FP8格式分别实现58.5 GFLOPS和117 GFLOPS的峰值算力，计算单元阵列利用率达99.4%，平均功耗低于60 mW，从而在TinyML应用场景中支持深度学习模型的设备端训练，同时保留高效处理其他常见线性代数问题的灵活性。