Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition

The paradigm shift towards local and on-device inference under stringent resource constraints is represented by the tiny machine learning (TinyML) domain. The primary goal of TinyML is to integrate intelligence into tiny, low-cost devices under strict resource, energy, and latency constraints. However, the ultra-resource-constrained nature of these devices can lead to increased inference execution time, which can be detrimental in latency critical applications. At the same time, TinyML applications are often associated with sensitive data. As such, latency optimization approaches that rely on training samples are infeasible when such data is unavailable, proprietary, or sensitive, highlighting a pressing need for optimization approaches that do not require access to the training dataset and can be applied directly to pre-trained models. Replacing costly multiplications with more hardware-efficient operations, such as shifts and additions, has been proposed as an effective method for reducing inference latency. However, post-training power-of-two (Po2) approaches are scarce and, in many cases, lead to unacceptable accuracy loss. In this work, we propose a framework that applies approximate matrix decomposition to a given CNN in order to optimize hardware implementations subject to strict constraints and without any need of re-training or fine-tuning steps. The genetic algorithm-driven framework explores different matrix decompositions and resulting multiplier-less CNN accelerator designs for FPGA targets. A comprehensive evaluation of different TinyML benchmarks demonstrates our framework's efficacy in generating latency-optimized implementations that satisfy strict accuracy and resource constraints, achieving an average 33% latency improvement with an average accuracy loss of 1.3% compared to typical systolic array-based FPGA accelerators.

翻译：面向极端资源约束下本地及设备端推理的范式转变，体现在微型机器学习（TinyML）领域。TinyML的核心目标是在严格的资源、能耗与延迟约束下，将智能集成至微型低成本设备中。然而，这类设备的超资源受限特性可能导致推理执行时间增加，在延迟敏感型应用中可能产生严重影响。同时，TinyML应用常涉及敏感数据。因此，依赖训练样本的延迟优化方法在数据不可用、专有或敏感时无法实施，这凸显了对无需访问训练数据集、可直接应用于预训练模型的优化方法的迫切需求。用移位和加法等更高效的硬件运算替代代价高昂的乘法运算，已被证明是降低推理延迟的有效手段。然而，面向训练后模型的2的幂次（Po2）方法较为稀缺，且常导致不可接受的精度损失。本文提出一种框架，通过对给定卷积神经网络（CNN）应用近似矩阵分解，在严格约束下优化硬件实现，无需任何重训练或微调步骤。该遗传算法驱动的框架探索不同矩阵分解方案，为现场可编程门阵列（FPGA）目标生成对应的无乘法器CNN加速器设计。对多种TinyML基准的综合评估表明，本框架能生成满足严格精度与资源约束的延迟优化实现，与典型脉动阵列型FPGA加速器相比，平均延迟提升33%，精度损失仅1.3%。