Bit-serial Processing-In-Memory (PIM) is an attractive paradigm for accelerator architectures, for parallel workloads such as Deep Learning (DL), because of its capability to achieve massive data parallelism at a low area overhead and provide orders-of-magnitude data movement savings by moving computational resources closer to the data. While many PIM architectures have been proposed, improvements are needed in communicating intermediate results to consumer kernels, for communication between tiles at scale, for reduction operations, and for efficiently performing bit-serial operations with constants. We present PIMSAB, a scalable architecture that provides spatially aware communication network for efficient intra-tile and inter-tile data movement and provides efficient computation support for generally inefficient bit-serial compute patterns. Our architecture consists of a massive hierarchical array of compute-enabled SRAMs (CRAMs) and is codesigned with a compiler to achieve high utilization. The key novelties of our architecture are: (1) providing efficient support for spatially-aware communication by providing local H-tree network for reductions, by adding explicit hardware for shuffling operands, and by deploying systolic broadcasting, and (2) taking advantage of the divisible nature of bit-serial computations through adaptive precision, bit-slicing and efficient handling of constant operations. When compared against a similarly provisioned modern Tensor Core GPU (NVIDIA A100), across common DL kernels and an end-to-end DL network (Resnet18), PIMSAB outperforms the GPU by 3x, and reduces energy by 4.2x. We compare PIMSAB with similarly provisioned state-of-the-art SRAM PIM (Duality Cache) and DRAM PIM (SIMDRAM) and observe a speedup of 3.7x and 3.88x respectively.
翻译:位串行内存处理(PIM)是深度学习(DL)等并行工作负载加速器架构的一种有吸引力的范式,它能够以低面积开销实现大规模数据并行,并通过将计算资源靠近数据来提供数量级的数据移动节省。尽管已提出多种PIM架构,但在将中间结果传输至消费内核、大规模瓦片间通信、归约操作以及高效执行带常数的位串行运算方面仍需改进。我们提出PIMSAB,一种可扩展架构,通过提供空间感知通信网络实现高效的瓦片内和瓦片间数据移动,并为通常低效的位串行计算模式提供高效计算支持。我们的架构由大规模分层计算型SRAM(CRAM)阵列组成,并与编译器协同设计以实现高利用率。该架构的关键创新点在于:(1)通过为归约操作提供局部H树网络、添加显式操作数混洗硬件以及部署脉动广播,为空间感知通信提供高效支持;(2)利用位串行计算的可拆分特性,实现自适应精度、位切片以及常数操作的高效处理。与同等配置的现代张量核心GPU(NVIDIA A100)相比,在常见DL内核和端到端DL网络(Resnet18)上,PIMSAB性能提升3倍,能耗降低4.2倍。我们还将PIMSAB与同等配置的最先进SRAM PIM(Duality Cache)及DRAM PIM(SIMDRAM)比较,分别观察到3.7倍和3.88倍的加速比。