Occamy：一款采用12nm FinFET工艺的432核双芯粒双HBM2E内存系统，支持8至64位稠密与稀疏计算，峰值性能达768 DP-GFLOP/s (Occamy: A 432-Core Dual-Chiplet Dual-HBM2E 768-DP-GFLOP/s RISC-V System for 8-to-64-bit Dense and Sparse Computing in 12nm FinFET)

Paul Scheffler,Thomas Benz,Viviane Potocnik,Tim Fischer,Luca Colagrande,Nils Wistoff,Yichao Zhang,Luca Bertaccini,Gianmarco Ottavi,Manuel Eggimann,Matheus Cavalcante,Gianna Paulin,Frank K. Gürkaynak,Davide Rossi,Luca Benini

from arxiv, 16 pages, 13 figures, 1 table. Accepted for publication in IEEE JSSC

ML and HPC applications increasingly combine dense and sparse memory access computations to maximize storage efficiency. However, existing CPUs and GPUs struggle to flexibly handle these heterogeneous workloads with consistently high compute efficiency. We present Occamy, a 432-Core, 768-DP-GFLOP/s, dual-HBM2E, dual-chiplet RISC-V system with a latency-tolerant hierarchical interconnect and in-core streaming units (SUs) designed to accelerate dense and sparse FP8-to-FP64 ML and HPC workloads. We implement Occamy's compute chiplets in 12 nm FinFET, and its passive interposer, Hedwig, in a 65 nm node. On dense linear algebra (LA), Occamy achieves a competitive FPU utilization of 89%. On stencil codes, Occamy reaches an FPU utilization of 83% and a technology-node-normalized compute density of 11.1 DP-GFLOP/s/mm2,leading state-of-the-art (SoA) processors by 1.7x and 1.2x, respectively. On sparse-dense linear algebra (LA), it achieves 42% FPU utilization and a normalized compute density of 5.95 DP-GFLOP/s/mm2, surpassing the SoA by 5.2x and 11x, respectively. On, sparse-sparse LA, Occamy reaches a throughput of up to 187 GCOMP/s at 17.4 GCOMP/s/W and a compute density of 3.63 GCOMP/s/mm2. Finally, we reach up to 75% and 54% FPU utilization on and dense (LLM) and graph-sparse (GCN) ML inference workloads. Occamy's RTL is freely available under a permissive open-source license.

翻译：机器学习和高性能计算应用日益结合稠密与稀疏内存访问计算，以最大化存储效率。然而，现有CPU和GPU难以灵活处理此类异构工作负载并保持持续的高计算效率。本文提出Occamy，一个包含432个核心、峰值性能达768 DP-GFLOP/s、集成双HBM2E内存的双芯粒RISC-V系统。该系统采用容忍延迟的层次化互连架构与核心内流处理单元，专为加速FP8至FP64精度的稠密与稀疏ML及HPC工作负载而设计。Occamy的计算芯粒采用12nm FinFET工艺实现，其无源中介层Hedwig采用65nm工艺节点制造。在稠密线性代数任务中，Occamy实现了89%的竞争性浮点单元利用率。在模板计算中，其浮点单元利用率达83%，工艺节点归一化计算密度为11.1 DP-GFLOP/s/mm²，分别领先现有最优处理器1.7倍和1.2倍。在稀疏-稠密线性代数任务中，系统实现42%的浮点单元利用率和5.95 DP-GFLOP/s/mm²的归一化计算密度，分别超越现有最优水平5.2倍和11倍。在稀疏-稀疏线性代数任务中，Occamy的吞吐量最高达187 GCOMP/s，能效为17.4 GCOMP/s/W，计算密度达3.63 GCOMP/s/mm²。最后，在稠密（大语言模型）与图稀疏（图卷积网络）的ML推理工作负载中，系统分别实现最高75%和54%的浮点单元利用率。Occamy的RTL设计已在宽松开源许可下公开提供。