CUDA-PIM: End-to-End Integration of Digital Processing-in-Memory from High-Level C++ to Microarchitectural Design

Digital processing-in-memory (PIM) architectures mitigate the memory wall problem by facilitating parallel bitwise operations directly within memory. Recent works have demonstrated their algorithmic potential for accelerating data-intensive applications; however, there remains a significant gap in the programming model and microarchitectural design. This is further exacerbated by the emerging model of partitions, which significantly complicates control and periphery. Therefore, inspired by NVIDIA CUDA, this paper provides an end-to-end architectural integration of digital memristive PIM from an abstract high-level C++ programming interface for vector operations to the low-level microarchitecture. We begin by proposing an efficient microarchitecture and instruction set architecture (ISA) that bridge the gap between the low-level control periphery and an abstraction of PIM parallelism into warps and threads. We subsequently propose a PIM compilation library that converts high-level C++ to ISA instructions, and a PIM driver that translates ISA instructions into PIM micro-operations. This drastically simplifies the development of PIM applications and enables PIM integration within larger existing C++ CPU/GPU programs for heterogeneous computing with significant ease. Lastly, we present an efficient GPU-accelerated simulator for the proposed PIM microarchitecture. Although slower than a theoretical PIM chip, this simulator provides an accessible platform for developers to start executing and debugging PIM algorithms. To validate our approach, we implement state-of-the-art matrix operations and FFT PIM-based algorithms as case studies. These examples demonstrate drastically simplified development without compromising performance, showing the potential and significance of CUDA-PIM.

翻译：数字存内处理（PIM）架构通过直接在存储器中实现并行按位运算来缓解存储墙问题。近期研究已展示其在加速数据密集型应用方面的算法潜力，然而在编程模型与微架构设计之间仍存在显著鸿沟。新兴的存储分区模型进一步加剧了这一问题，极大增加了控制与外围电路的复杂性。为此，受NVIDIA CUDA启发，本文提出了一种面向数字忆阻器PIM的端到端架构集成方案，涵盖从面向向量运算的高级C++抽象编程接口到底层微架构的全链路设计。我们首先提出一种高效微架构与指令集架构（ISA），用以弥合底层控制外围电路与PIM并行性抽象（映射为线程束与线程）之间的语义鸿沟。继而提出将高级C++转换为ISA指令的PIM编译库，以及将ISA指令翻译为PIM微操作的PIM驱动程序。这显著简化了PIM应用开发流程，并使得PIM能够以极低复杂度集成至现有大型C++ CPU/GPU异构计算程序中。最后，我们为所提出的PIM微架构构建了高效GPU加速仿真器。尽管其运行速度低于理论PIM芯片，但该仿真器为开发者提供了便捷的PIM算法执行与调试平台。为验证方案有效性，我们以当前主流的矩阵运算与基于PIM的FFT算法作为案例进行实现。这些实例表明，该方法在不损失性能的前提下实现了开发流程的大幅简化，证明了CUDA-PIM的潜力与重要价值。