Shared L1-memory clusters of streamlined instruction processors (processing elements - PEs) are commonly used as building blocks in modern, massively parallel computing architectures (e.g. GP-GPUs). Scaling out these architectures by increasing the number of clusters incurs computational and power overhead, caused by the requirement to split and merge large data structures in chunks and move chunks across memory hierarchies via the high-latency global interconnect. Scaling up the cluster reduces buffering, copy, and synchronization overheads. However, the complexity of a fully connected cores-to-L1-memory crossbar grows quadratically with PE-count, posing a major physical implementation challenge. We present TeraPool, a physically implementable, >1000 floating-point-capable RISC-V PEs scaled-up cluster design, sharing a Multi-MegaByte >4000-banked L1 memory via a low latency hierarchical interconnect (1-7/9/11 cycles, depending on target frequency). Implemented in 12nm FinFET technology, TeraPool achieves near-gigahertz frequencies (910MHz) typical, 0.80 V/25C. The energy-efficient hierarchical PE-to-L1-memory interconnect consumes only 9-13.5pJ for memory bank accesses, just 0.74-1.1x the cost of a FP32 FMA. A high-bandwidth main memory link is designed to manage data transfers in/out of the shared L1, sustaining transfers at the full bandwidth of an HBM2E main memory. At 910MHz, the cluster delivers up to 1.89 single precision TFLOP/s peak performance and up to 200GFLOP/s/W energy efficiency (at a high IPC/PE of 0.8 on average) in benchmark kernels, demonstrating the feasibility of scaling a shared-L1 cluster to a thousand PEs, four times the PE count of the largest clusters reported in literature.
翻译:精简指令处理器(处理单元-PE)构成的共享L1内存集群通常作为现代大规模并行计算架构(例如通用图形处理器)的基本构建模块。通过增加集群数量横向扩展此类架构会产生计算和功耗开销,原因在于需要将大型数据结构分块拆分与合并,并通过高延迟全局互连在内存层次结构间迁移数据块。纵向扩展集群则能减少缓冲、复制和同步开销。然而,全连接核心至L1内存交叉开关的复杂度随PE数量呈二次方增长,这构成了主要的物理实现挑战。本文提出TeraPool——一种物理可实现的、具备千级以上浮点运算能力的RISC-V PE纵向扩展集群设计,通过低延迟分层互连(1-7/9/11周期,取决于目标频率)共享容量达数兆字节、包含4000+存储体的L1内存。基于12nm FinFET工艺实现的TeraPool在0.80 V/25℃条件下达到近千兆赫典型频率(910MHz)。其高能效分层PE至L1内存互连在访问存储体时仅消耗9-13.5pJ,仅相当于单精度浮点乘加运算能耗的0.74-1.1倍。设计的高带宽主存链路可管理共享L1内存的数据吞吐,维持与HBM2E主存全带宽匹配的传输速率。在910MHz频率下,该集群在基准测试内核中实现最高1.89单精度TFLOP/s峰值性能及200GFLOP/s/W能效(PE平均IPC达0.8高位),验证了将共享L1集群扩展至千级PE的可行性,其PE规模达到文献报道最大集群的四倍。