Data movement between the processor and the main memory is a first-order obstacle against improving performance and energy efficiency in modern systems. To address this obstacle, Processing-using-Memory (PuM) is a promising approach where bulk-bitwise operations are performed leveraging intrinsic analog properties within the DRAM array and massive parallelism across DRAM columns. Unfortunately, 1) modern off-the-shelf DRAM chips do not officially support PuM operations, and 2) existing techniques of performing PuM operations on off-the-shelf DRAM chips suffer from two key limitations. First, these techniques have low success rates, i.e., only a small fraction of DRAM columns can correctly execute PuM operations because they operate beyond manufacturer-recommended timing constraints, causing these operations to be highly susceptible to noise and process variation. Second, these techniques have limited compute primitives, preventing them from fully leveraging parallelism across DRAM columns and thus hindering their performance benefits. We propose PULSAR, a new technique to enable high-success-rate and high-performance PuM operations in off-the-shelf DRAM chips. PULSAR leverages our new observation that a carefully crafted sequence of DRAM commands simultaneously activates up to 32 DRAM rows. PULSAR overcomes the limitations of existing techniques by 1) replicating the input data to improve the success rate and 2) enabling new bulk bitwise operations (e.g., many-input majority, Multi-RowInit, and Bulk-Write) to improve the performance. Our analysis on 120 off-the-shelf DDR4 chips from two major manufacturers shows that PULSAR achieves a 24.18% higher success rate and 121% higher performance over seven arithmetic-logic operations compared to FracDRAM, a state-of-the-art off-the-shelf DRAM-based PuM technique.
翻译:处理器与主存之间的数据移动是制约现代系统性能和能效提升的首要障碍。为此,内存内处理(PuM)提供了一种有前景的解决方案,该方法利用DRAM阵列内部的模拟特性和DRAM列间的大规模并行性,实现批量比特操作。然而,现有技术存在两大局限:1) 商用现成DRAM芯片官方不支持PuM操作;2) 现有商用DRAM芯片PuM技术面临成功率低(仅少数DRAM列可正确执行操作,因超出制造商推荐时序约束而极易受噪声和工艺偏差影响)和计算原语有限(无法充分利用DRAM列间并行性,限制性能增益)的问题。本文提出PULSAR技术,在商用现成DRAM芯片中实现高成功率、高性能的PuM操作。PULSAR基于新发现:精心设计的DRAM命令序列可同时激活多达32行DRAM。该技术通过1) 复制输入数据提升成功率,2) 引入新型批量比特操作(如多输入多数表决、多行初始化、批量写入)提升性能,克服了现有技术局限。对两大制造商生产的120颗商用DDR4芯片分析表明,相较于现有最优的商用DRAM PuM技术FracDRAM,PULSAR在七种算术逻辑操作上实现了24.18%的成功率提升和121%的性能提升。