Processing-in-memory (PIM) architectures are emerging to reduce data movement in data-intensive applications. These architectures seek to exploit the same physical devices for both information storage and logic, thereby dwarfing the required data transfer and utilizing the full internal memory bandwidth. Whereas analog PIM utilizes the inherent connectivity of crossbar arrays for approximate matrix-vector multiplication in the analog domain, digital PIM architectures enable bitwise logic operations with massive parallelism across columns of data within memory arrays. Several recent works have extended the computational capabilities of digital PIM architectures towards the full-precision (single-precision floating-point) acceleration of convolutional neural networks (CNNs); yet, they lack a comprehensive comparison to GPUs. In this paper, we examine the potential of digital PIM for CNN acceleration through an updated quantitative comparison with GPUs, supplemented with an analysis of the overall limitations of digital PIM. We begin by investigating the different PIM architectures from a theoretical perspective to understand the underlying performance limitations and improvements compared to state-of-the-art hardware. We then uncover the tradeoffs between the different strategies through a series of benchmarks ranging from memory-bound vectored arithmetic to CNN acceleration. We conclude with insights into the general performance of digital PIM architectures for different data-intensive applications.
翻译:存内计算(PIM)架构正逐渐兴起,旨在减少数据密集型应用中的数据移动。这些架构试图利用相同的物理器件同时实现信息存储与逻辑操作,从而大幅缩减数据传输需求并充分利用内部存储带宽。模拟PIM利用交叉阵列的固有连接在模拟域实现近似矩阵向量乘法,而数字PIM架构则能在存储阵列中跨数据列执行大规模并行的按位逻辑运算。近期多项研究将数字PIM架构的计算能力扩展到卷积神经网络的单精度浮点全精度加速,但缺乏与GPU的全面对比。本文通过更新后的量化对比分析,结合数字PIM整体局限性的研究,评估了数字PIM在CNN加速中的潜力。我们首先从理论角度探究不同PIM架构,以理解其与现有先进硬件相比的性能限制与改进空间。随后通过从访存受限的向量算术到CNN加速的一系列基准测试,揭示不同策略间的权衡。最后,我们针对不同数据密集型应用场景总结数字PIM架构的通用性能特征。