RISC-V processors encounter substantial challenges in deploying multi-precision deep neural networks (DNNs) due to their restricted precision support, constrained throughput, and suboptimal dataflow design. To tackle these challenges, a scalable RISC-V vector (RVV) processor, namely SPEED, is proposed to enable efficient multi-precision DNN inference by innovations from customized instructions, hardware architecture, and dataflow mapping. Firstly, dedicated customized RISC-V instructions are proposed based on RVV extensions, providing SPEED with fine-grained control over processing precision ranging from 4 to 16 bits. Secondly, a parameterized multi-precision systolic array unit is incorporated within the scalable module to enhance parallel processing capability and data reuse opportunities. Finally, a mixed multi-precision dataflow strategy, compatible with different convolution kernels and data precision, is proposed to effectively improve data utilization and computational efficiency. We perform synthesis of SPEED in TSMC 28nm technology. The experimental results demonstrate that SPEED achieves a peak throughput of 287.41 GOPS and an energy efficiency of 1335.79 GOPS/W at 4-bit precision condition, respectively. Moreover, when compared to the pioneer open-source vector processor Ara, SPEED provides an area efficiency improvement of 2.04$\times$ and 1.63$\times$ under 16-bit and 8-bit precision conditions, respectively, which shows SPEED's significant potential for efficient multi-precision DNN inference.
翻译:RISC-V处理器在部署多精度深度神经网络(DNN)时面临重大挑战,主要源于其受限的精度支持、不足的吞吐量以及次优的数据流设计。为应对这些挑战,本文提出了一种可扩展的RISC-V向量(RVV)处理器——SPEED,通过定制指令、硬件架构与数据流映射的创新,实现高效多精度DNN推理。首先,基于RVV扩展提出了专用的定制RISC-V指令,使SPEED能够以4至16位的粒度精细控制处理精度。其次,在可扩展模块中集成了参数化的多精度脉动阵列单元,以增强并行处理能力与数据重用机会。最后,提出了一种兼容不同卷积核与数据精度的混合多精度数据流策略,有效提升数据利用率与计算效率。我们基于台积电28nm工艺对SPEED进行了综合实现。实验结果表明,在4位精度条件下,SPEED实现了287.41 GOPS的峰值吞吐量与1335.79 GOPS/W的能效。此外,与先驱开源向量处理器Ara相比,在16位和8位精度条件下,SPEED的面积效率分别提升了2.04倍和1.63倍,彰显了其在高能效多精度DNN推理中的巨大潜力。