Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters

Systolic arrays and shared-L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid architectures and complex programming models, the second are versatile and easy to program but require explicit dataflow management and synchronization. This work aims at enabling efficient systolic execution on shared-L1-memory manycore clusters. We devise a flexible architecture where small and energy-efficient RISC-V cores act as the systolic array's processing elements (PEs) and can form diverse, reconfigurable systolic topologies through queues mapped in the cluster's shared memory. We introduce two low-overhead RISC-V ISA extensions for efficient systolic execution, namely Xqueue and Queue-linked registers (QLRs), which support queue management in hardware. The Xqueue extension enables single-instruction access to shared-memory-mapped queues, while QLRs allow implicit and autonomous access to them, relieving the cores of explicit communication instructions. We demonstrate Xqueue and QLRs in MemPool, an open-source shared-memory cluster with 256 PEs, and analyze the hybrid systolic-shared-memory architecture's trade-offs on several DSP kernels with diverse arithmetic intensity. For an area increase of just 6%, our hybrid architecture can double MemPool's compute unit utilization, reaching up to 73%. In typical conditions (TT/0.80V/25{\deg}C), in a 22 nm FDX technology, our hybrid architecture runs at 600 MHz with no frequency degradation and is up to 65% more energy efficient than the shared-memory baseline, achieving up to 208 GOPS/W, with up to 63% of power spent in the PEs.

翻译：脉动阵列与共享L1存储器多核集群是两种常用于加速并行工作负载的架构范式，二者在性能取舍上各有侧重。前者在刚性架构与复杂编程模型代价下擅长规则数据流处理，后者虽灵活易编程但需显式数据流管理与同步。本文旨在实现共享L1存储器多核集群上的高效脉动执行。我们设计了一种灵活架构，采用小型高能效RISC-V核心作为脉动阵列的处理单元（PE），并通过集群共享存储器中映射的队列构建可重构的多样化脉动拓扑。我们引入两种轻量级RISC-V ISA扩展——Xqueue与队列链接寄存器（QLR），用于硬件级队列管理以支持高效脉动执行。Xqueue扩展支持单指令访问共享存储器映射队列，而QLR允许对队列进行隐式自主访问，免去核心执行显式通信指令。我们在包含256个PE的开源共享存储器集群MemPool中验证了Xqueue与QLR，并在多个不同计算密度的DSP核上分析了混合脉动共享存储器架构的性能权衡。仅以6%的面积开销增长，该混合架构即可使MemPool计算单元利用率翻倍，最高达73%。在22nm FDX工艺典型条件（TT/0.80V/25°C）下，混合架构运行频率达600MHz且无频率降级，能效较共享存储器基线提升65%，峰值能效达208 GOPS/W，其中PE功耗占比达63%。