Digital processing-in-memory (PIM) architectures mitigate the memory wall problem by facilitating parallel bitwise operations directly within the memory. Recent works have demonstrated their algorithmic potential for accelerating data-intensive applications; however, there remains a significant gap in the programming model and microarchitectural design. This is further exacerbated by aspects unique to memristive PIM such as partitions and operations across both directions of the memory array. To address this gap, this paper provides an end-to-end architectural integration of digital memristive PIM from a high-level Python library for tensor operations (similar to NumPy and PyTorch) to the low-level microarchitectural design. We begin by proposing an efficient microarchitecture and instruction set architecture (ISA) that bridge the gap between the low-level control periphery and an abstraction of PIM parallelism. We subsequently propose a PIM development library that converts high-level Python to ISA instructions and a PIM driver that translates ISA instructions into PIM micro-operations. We evaluate PyPIM via a cycle-accurate simulator on a wide variety of benchmarks that both demonstrate the versatility of the Python library and the performance compared to theoretical PIM bounds. Overall, PyPIM drastically simplifies the development of PIM applications and enables the conversion of existing tensor-oriented Python programs to PIM with ease.
翻译:数字内存处理(PIM)架构通过在内存内部直接支持并行位运算,缓解了内存墙问题。近期研究揭示了其在加速数据密集型应用方面的算法潜力,然而在编程模型和微架构设计层面仍存在显著差距。忆阻器PIM特有的技术挑战(如内存阵列双向分区与跨方向操作)进一步加剧了这一困境。为弥合此鸿沟,本文实现了数字忆阻器PIM从高层张量运算Python库(类似NumPy与PyTorch)到底层微架构设计的端到端架构集成。我们首先提出一种高效微架构与指令集架构(ISA),用以衔接底层控制外围电路与PIM并行性抽象层。继而开发出可将高层Python代码转换为ISA指令的PIM开发库,以及将ISA指令翻译为PIM微操作的系统驱动。通过周期精确模拟器在多样化测试基准上的评估,既验证了Python库的多功能特性,也展现了其相较于理论PIM性能边界的实际表现。总体而言,PyPIM极大简化了PIM应用开发流程,使得现有面向张量的Python程序能够轻松迁移至PIM架构。