DARTH-PUM：一种混合式内存内处理架构 (DARTH-PUM: A Hybrid Processing-Using-Memory Architecture)

Analog processing-using-memory (PUM; a.k.a. in-memory computing) makes use of electrical interactions inside memory arrays to perform bulk matrix-vector multiplication (MVM) operations. However, many popular matrix-based kernels need to execute non-MVM operations, which analog PUM cannot directly perform. To retain its energy efficiency, analog PUM architectures augment memory arrays with CMOS-based domain-specific fixed-function hardware to provide complete kernel functionality, but the difficulty of integrating such specialized CMOS logic with memory arrays has largely limited analog PUM to being an accelerator for machine learning inference, or for closely related kernels. An opportunity exists to harness analog PUM for general-purpose computation: recent works have shown that memory arrays can also perform Boolean PUM operations, albeit with very different supporting hardware and electrical signals than analog PUM. We propose DARTH-PUM, a general-purpose hybrid PUM architecture that tackles key hardware and software challenges to integrating analog PUM and digital PUM. We propose optimized peripheral circuitry, coordinating hardware to manage and interface between both types of PUM, an easy-to-use programming interface, and low-cost support for flexible data widths. These design elements allow us to build a practical PUM architecture that can execute kernels fully in memory, and can scale easily to cater to domains ranging from embedded applications to large-scale data-driven computing. We show how three popular applications (AES encryption, convolutional neural networks, large-language models) can map to and benefit from DARTH-PUM, with speedups of 59.4x, 14.8x, and 40.8x over an analog+CPU baseline.

翻译：模拟内存内处理（Processing-Using-Memory，简称PUM，亦称内存计算）利用存储器阵列内部的电学相互作用来执行批量矩阵-向量乘法运算。然而，许多流行的基于矩阵的计算核需要执行非MVM操作，这是模拟PUM无法直接完成的。为保持其能效优势，模拟PUM架构通常采用基于CMOS的领域专用固定功能硬件来增强存储器阵列，以提供完整的计算核功能。但将此类专用CMOS逻辑与存储器阵列集成的困难，在很大程度上限制了模拟PUM主要作为机器学习推理或紧密相关计算核的加速器。利用模拟PUM进行通用计算存在一个机遇：近期研究表明，存储器阵列也能执行布尔PUM操作，尽管其所需的支持硬件和电信号与模拟PUM截然不同。我们提出了DARTH-PUM，一种通用的混合PUM架构，旨在解决集成模拟PUM与数字PUM的关键硬件和软件挑战。我们提出了优化的外围电路、用于管理和协调两类PUM间接口的协调硬件、易于使用的编程接口，以及对灵活数据宽度的低成本支持。这些设计要素使我们能够构建一种实用的PUM架构，该架构可在内存中完整执行计算核，并能轻松扩展以适应从嵌入式应用到大规模数据驱动计算的各种领域。我们展示了三种流行应用（AES加密、卷积神经网络、大语言模型）如何映射到DARTH-PUM并从中受益，相较于模拟PUM+CPU基线，分别实现了59.4倍、14.8倍和40.8倍的加速。