Large-scale ML accelerators rely on large numbers of PEs, imposing strict bounds on the area and energy budget of each PE. Prior work demonstrates that limited dual-issue capabilities can be efficiently integrated into a lightweight in-order open-source RISC-V core (Snitch), with a geomean IPC boost of 1.6x and a geomean energy efficiency gain of 1.3x, obtained by concurrently executing integer and FP instructions. Unfortunately, this required a complex and error-prone low level programming model (COPIFT). We introduce COPIFTv2 which augments Snitch with lightweight queues enabling direct, fine-grained communication and synchronization between integer and FP threads. By eliminating the tiling and software pipelining steps of COPIFT, we can remove much of its complexity and software overheads. As a result, COPIFTv2 achieves up to a 1.49x speedup and a 1.47x energy-efficiency gain over COPIFT, and a peak IPC of 1.81. Overall, COPIFTv2 significantly enhances the efficiency and programmability of dual-issue execution on lightweight cores. Our implementation is fully open source and performance experiments are reproducible using free software.
翻译:大规模机器学习加速器依赖大量处理单元,这对每个处理单元的面积和能耗预算提出了严格限制。先前研究表明,有限的双发射能力可被高效集成到一款轻量级顺序执行开源RISC-V内核(Snitch)中,通过并发执行整数与浮点指令,实现了1.6倍的几何平均IPC提升和1.3倍的几何平均能效增益。然而,该方法需要依赖复杂且易出错的底层编程模型(COPIFT)。本文提出COPIFTv2,通过为Snitch配备轻量级队列,实现了整数与浮点线程间的直接细粒度通信和同步。通过消除COPIFT中的分块和软件流水线步骤,我们大幅降低了其复杂性和软件开销。实验表明,COPIFTv2相比COPIFT最高可获得1.49倍加速比和1.47倍能效提升,峰值IPC达到1.81。总体而言,COPIFTv2显著提升了轻量级内核上双发射执行的效率与可编程性。我们的实现完全开源,性能实验可通过自由软件复现。