Radio Access Networks (RAN) workloads are rapidly scaling up in data processing intensity and throughput as the 5G (and beyond) standards grow in number of antennas and sub-carriers. Offering flexible Processing Elements (PEs), efficient memory access, and a productive parallel programming model, many-core clusters are a well-matched architecture for next-generation software-defined RANs, but staggering performance requirements demand a high number of PEs coupled with extreme Power, Performance and Area (PPA) efficiency. We present the architecture, design, and full physical implementation of Terapool-SDR, a cluster for Software Defined Radio (SDR) with 1024 latency-tolerant, compact RV32 PEs, sharing a global view of a 4MiB, 4096-banked, L1 memory. We report various feasible configurations of TeraPool-SDR featuring an ultra-high bandwidth PE-to-L1-memory interconnect, clocked at 730MHz, 880MHz, and 924MHz (TT/0.80 V/25 {\deg}C) in 12nm FinFET technology. The TeraPool-SDR cluster achieves high energy efficiency on all SDR key kernels for 5G RANs: Fast Fourier Transform (93GOPS/W), Matrix-Multiplication (125GOPS/W), Channel Estimation (96GOPS/W), and Linear System Inversion (61GOPS/W). For all the kernels, it consumes less than 10W, in compliance with industry standards.
翻译:无线接入网络(RAN)工作负载的数据处理强度与吞吐量正随着5G(及未来)标准中天线数与子载波数的增长而快速攀升。多核集群通过提供灵活的处理单元(PE)、高效的内存访问及高效的并行编程模型,成为下一代软件定义RAN的理想架构选择,但其惊人的性能需求要求大量PE同时具备极致的功耗、性能与面积(PPA)效率。本文介绍了Terapool-SDR的架构、设计及完整物理实现——一个面向软件定义无线电(SDR)的集群,包含1024个延迟容忍、紧凑型RV32处理器核(PE),共享全局可见的4MiB、4096存储体L1存储器。我们报告了TeraPool-SDR的多种可行配置,其采用超带宽PE-to-L1内存互连,在12nm FinFET工艺下分别以730MHz、880MHz和924MHz(TT/0.80V/25℃)的时钟频率运行。TeraPool-SDR集群在5G RAN的所有关键SDR核函数上均实现了高能效:快速傅里叶变换(93GOPS/W)、矩阵乘法(125GOPS/W)、信道估计(96GOPS/W)及线性系统求逆(61GOPS/W)。对于所有核函数,其功耗低于10W,符合行业标准。