On-chip DNN inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present DARKSIDE, a System-on-Chip with a heterogeneous cluster of 8 RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost performance and efficiency on key compute-intensive Deep Neural Network (DNN) kernels, the cluster is enriched with three digital accelerators: a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); a minimal overhead datamover to marshal 1-b to 32-b data on-the-fly; a 16-b floating point Tensor Product Engine (TPE) for tiled matrix-multiplication acceleration. DARKSIDE is implemented in 65nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency - enough to enable on-chip floating-point training at competitive speed coupled with ultra-low power quantized inference.
翻译:极致边缘(TinyML)的片上DNN推理与训练对延迟、吞吐量、精度和灵活性提出了严格的要求。异构集群是应对这一挑战的可行方案,它将DSP增强型内核的灵活性与专用加速器的高性能和高能效相结合。本文提出DARKSIDE——一款片上系统,包含由8个RISC-V内核组成的异构集群,并配备了2至32位混合精度整数运算增强单元。为提升关键计算密集型深度神经网络(DNN)内核的性能与效率,该集群集成了三种数字加速器:一种用于低数据复用深度可分离卷积内核的专用引擎(最高达30 MAC/周期);一种用于动态整理1至32位数据的低开销数据搬运器;以及一种用于平铺矩阵乘法加速的16位浮点张量乘积引擎(TPE)。DARKSIDE采用65nm CMOS工艺实现。在处理2位整数DNN内核时,该集群峰值整数性能达65 GOPS,峰值能效达835 GOPS/W;面向浮点张量运算时,TPE可提供高达18.2 GFLOPS的性能或300 GFLOPS/W的能效——足以在支持超低功耗量化推理的同时,以具备竞争力的速度实现片上浮点训练。