Towards Efficient LUT-based PIM: A Scalable and Low-Power Approach for Modern Workloads

Data movement in memory-intensive workloads, such as deep learning, incurs energy costs that are over three orders of magnitude higher than the cost of computation. Since these workloads involve frequent data transfers between memory and processing units, addressing data movement overheads is crucial for improving performance. Processing-using-memory (PuM) offers an effective solution by enabling in-memory computation, thereby minimizing data transfers. In this paper we propose Lama, a LUT-based PuM architecture designed to efficiently execute SIMD operations by supporting independent column accesses within each mat of a DRAM subarray. Lama exploits DRAM's mat-level parallelism and open-page policy to significantly reduce the number of energy-intensive memory activation (ACT) commands, which are the primary source of overhead in most PuM architectures. Unlike prior PuM solutions, Lama supports up to 8-bit operand precision without decomposing computations, while incurring only a 2.47% area overhead. Our evaluation shows Lama achieves an average performance improvement of 8.5x over state-of-the-art PuM architectures and a 3.8x improvement over CPU, along with energy efficiency gains of 6.9x/8x, respectively, for bulk 8-bit multiplication. We also introduce LamaAccel, an HBM-based PuM accelerator that utilizes Lama to accelerate the inference of attention-based models. LamaAccel employs exponential quantization to optimize product/accumulation in dot-product operations, transforming them into simpler tasks like addition and counting. LamaAccel delivers up to 9.3x/19.2x reduction in energy and 4.8x/9.8x speedup over TPU/GPU, along with up to 5.8x energy reduction and 2.1x speedup over a state-of-the-art PuM baseline.

翻译：在深度学习等内存密集型工作负载中，数据移动产生的能耗比计算成本高出三个数量级以上。由于这类工作负载涉及内存与处理单元间频繁的数据传输，解决数据移动开销对提升性能至关重要。存内计算通过实现内存内部计算，有效减少数据传输，为此提供了解决方案。本文提出Lama，一种基于查找表的存内计算架构，通过支持DRAM子阵列内每个存储体中的独立列访问，高效执行SIMD操作。Lama利用DRAM的存储体级并行性和开页策略，显著减少了高能耗的内存激活命令数量——这是大多数存内计算架构中开销的主要来源。与现有存内计算方案不同，Lama无需分解计算即可支持高达8位的操作数精度，且仅产生2.47%的面积开销。评估结果表明，对于批量8位乘法运算，Lama相比先进存内计算架构平均获得8.5倍的性能提升，相比CPU实现3.8倍提升，能效分别提升6.9倍/8倍。我们还提出LamaAccel，一种基于HBM的存内计算加速器，利用Lama加速基于注意力模型的推理。LamaAccel采用指数量化优化点积运算中的乘积累加操作，将其转化为加法与计数等简单任务。相比TPU/GPU，LamaAccel最高可降低9.3倍/19.2倍能耗并实现4.8倍/9.8倍加速；相比先进存内计算基线，最高可降低5.8倍能耗并实现2.1倍加速。