Shared L1 memory clusters are a common architectural pattern (e.g., in GPGPUs) for building efficient and flexible multi-processing-element (PE) engines. However, it is a common belief that these tightly-coupled clusters would not scale beyond a few tens of PEs. In this work, we tackle scaling shared L1 clusters to hundreds of PEs while supporting a flexible and productive programming model and maintaining high efficiency. We present MemPool, a manycore system with 256 RV32IMAXpulpimg "Snitch" cores featuring application-tunable functional units. We designed and implemented an efficient low-latency PE to L1-memory interconnect, an optimized instruction path to ensure each PE's independent execution, and a powerful DMA engine and system interconnect to stream data in and out. MemPool is easy to program, with all the cores sharing a global view of a large, multi-banked, L1 scratchpad memory, accessible within at most five cycles in the absence of conflicts. We provide multiple runtimes to program MemPool at different abstraction levels and illustrate its versatility with a wide set of applications. MemPool runs at 600 MHz (60 gate delays) in typical conditions (TT/0.80V/25{\deg}C) in 22 nm FDX technology and achieves a performance of up to 229 GOPS or 192 GOPS/W with less than 2% of execution stalls.
翻译:共享一级存储器簇是构建高效灵活的多处理单元(PE)引擎的常见架构模式(例如在GPGPU中)。然而,普遍认为这种紧耦合簇的规模难以超过几十个PE。本研究突破了共享一级存储器簇的扩展瓶颈,将其扩展至数百个PE,同时支持灵活高效的编程模型并保持高能效。我们提出了MemPool——一个包含256个配备应用可调功能单元的RV32IMAXpulpimg "Snitch"内核的众核系统。我们设计并实现了一种高效低延迟的PE至一级存储器互连架构、优化了确保每个PE独立执行的指令路径,并构建了用于数据流传输的强大DMA引擎与系统互连。MemPool易于编程,所有内核共享一个大型多体一级暂存存储器的全局视图,在无冲突情况下访问延迟不超过五个时钟周期。我们提供了多种运行时环境以支持不同抽象层级的MemPool编程,并通过广泛的应用程序展示了其多功能性。采用22纳米FDX工艺在典型条件(TT/0.80V/25°C)下,MemPool以600MHz(60门延迟)运行,实现高达229 GOPS或192 GOPS/W的性能,执行停顿率低于2%。