Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition

The paradigm shift towards local and on-device inference under stringent resource constraints is represented by the tiny machine learning (TinyML) domain. The primary goal of \gls{tml} is to integrate intelligence into tiny, low-cost devices under strict resource, energy, and latency constraints. However, the ultra-resource-constrained nature of these devices can lead to increased inference execution time, which can be detrimental in latency critical applications. At the same time, TinyML applications are often associated with sensitive data. As such, latency optimization approaches that rely on training samples are infeasible when such data is unavailable, proprietary, or sensitive, highlighting a pressing need for optimization approaches that do not require access to the training dataset and can be applied directly to pre-trained models. Replacing costly multiplications with more hardware-efficient operations, such as shifts and additions, has been proposed as an effective method for reducing inference latency. However, post-training power-of-two (Po2) approaches are scarce and, in many cases, lead to unacceptable accuracy loss. In this work, we propose a framework that applies approximate matrix decomposition to a given CNN in order to optimize hardware implementations subject to strict constraints and without any need of re-training or fine-tuning steps. The genetic algorithm-driven framework explores different matrix decompositions and resulting multiplier-less CNN accelerator designs for FPGA targets. A comprehensive evaluation of different TinyML benchmarks demonstrates our framework's efficacy in generating latency-optimized implementations that satisfy strict accuracy and resource constraints, achieving an average 33% latency improvement with an average accuracy loss of 1.3% compared to typical systolic array-based FPGA accelerators.

翻译：微型机器学习（TinyML）领域代表了在严苛资源约束下向本地及设备端推理的范式转变。TinyML的主要目标是在严格的资源、能耗和延迟约束下，将智能集成到微小、低成本的设备中。然而，这些设备超资源受限的特性会导致推理执行时间增加，这在延迟敏感型应用中可能具有破坏性影响。同时，TinyML应用常涉及敏感数据。因此，依赖训练样本的延迟优化方法在数据不可获取、涉及专利或敏感数据时难以实施，这凸显了对无需访问训练数据集且可直接应用于预训练模型优化方法的迫切需求。将高成本乘法运算替换为更高效的硬件操作（如移位与加法）已被证明是降低推理延迟的有效手段。然而，训练后功率二值化方法较为稀缺且在多数情况下会导致不可接受的精度损失。本文提出一种框架，通过对给定卷积神经网络应用近似矩阵分解，在无需任何重训练或微调步骤的前提下，实现受严格约束的硬件优化设计。该框架采用遗传算法驱动，探索不同矩阵分解方案及由此产生的无乘法器CNN加速器架构，面向FPGA目标平台。针对不同TinyML基准测试的综合评估表明，与典型脉动阵列式FPGA加速器相比，本框架生成的延迟优化实现能在满足严格精度与资源约束的条件下，平均获得33%的延迟提升，同时精度损失仅为1.3%。