Deep neural networks (DNNs) face significant challenges when deployed on resource-constrained extreme edge devices due to their computational and data-intensive nature. While standalone accelerators tailored for specific application scenarios suffer from inflexible control and limited programmability, generic hardware acceleration platforms coupled with RISC-V CPUs can enable high reusability and flexibility, yet typically at the expense of system level efficiency and low utilization. To fill this gap, we propose OpenGeMM, an open-source acceleration platform, jointly demonstrating high efficiency and utilization, as well as ease of configurability and programmability. OpenGeMM encompasses a parameterized Chisel-coded GeMM accelerator, a lightweight RISC-V processor, and a tightly coupled multi-banked scratchpad memory. The GeMM core utilization and system efficiency are boosted through three mechanisms: configuration pre-loading, input pre-fetching with output buffering, and programmable strided memory access. Experimental results show that OpenGeMM can consistently achieve hardware utilization ranging from 81.89% to 99.34% across diverse CNN and Transformer workloads. Compared to the SotA open-source Gemmini accelerator, OpenGeMM demonstrates a 3.58x to 16.40x speedup on normalized throughput across a wide variety ofGeMM workloads, while achieving 4.68 TOPS/W system efficiency.
翻译:深度神经网络(DNNs)因其计算密集与数据密集的特性,在部署于资源受限的极边缘设备时面临重大挑战。专为特定应用场景定制的独立加速器存在控制僵化与可编程性有限的问题,而与RISC-V CPU结合的通用硬件加速平台虽能实现高复用性与灵活性,却通常以牺牲系统级效率与利用率为代价。为填补这一空白,我们提出了OpenGeMM——一个开源的加速平台,它同时展现出高效率、高利用率以及易于配置和编程的特性。OpenGeMM包含一个参数化的Chisel编码通用矩阵乘法(GeMM)加速器、一个轻量级RISC-V处理器,以及一个紧耦合的多体暂存器内存。GeMM核心利用率与系统效率通过三种机制得到提升:配置预加载、带输出缓冲的输入预取,以及可编程的跨步内存访问。实验结果表明,OpenGeMM在多种CNN与Transformer工作负载下,硬件利用率能持续达到81.89%至99.34%。与最先进的开源Gemmini加速器相比,OpenGeMM在多种通用矩阵乘法工作负载上实现了3.58倍至16.40倍的归一化吞吐量加速,同时达到了4.68 TOPS/W的系统效率。