Analog processing-using-memory (PUM; a.k.a. in-memory computing) makes use of electrical interactions inside memory arrays to perform bulk matrix-vector multiplication (MVM) operations. However, many popular matrix-based kernels need to execute non-MVM operations, which analog PUM cannot directly perform. To retain its energy efficiency, analog PUM architectures augment memory arrays with CMOS-based domain-specific fixed-function hardware to provide complete kernel functionality, but the difficulty of integrating such specialized CMOS logic with memory arrays has largely limited analog PUM to being an accelerator for machine learning inference, or for closely related kernels. An opportunity exists to harness analog PUM for general-purpose computation: recent works have shown that memory arrays can also perform Boolean PUM operations, albeit with very different supporting hardware and electrical signals than analog PUM. We propose DARTH-PUM, a general-purpose hybrid PUM architecture that tackles key hardware and software challenges to integrating analog PUM and digital PUM. We propose optimized peripheral circuitry, coordinating hardware to manage and interface between both types of PUM, an easy-to-use programming interface, and low-cost support for flexible data widths. These design elements allow us to build a practical PUM architecture that can execute kernels fully in memory, and can scale easily to cater to domains ranging from embedded applications to large-scale data-driven computing. We show how three popular applications (AES encryption, convolutional neural networks, large language models) can map to and benefit from DARTH-PUM, with speedups of 59.4x, 14.8x, and 40.8x over an analog+CPU baseline.
翻译:模拟内存处理(PUM;又称存内计算)利用存储阵列内部的电相互作用来执行大规模矩阵向量乘法(MVM)运算。然而,许多流行的基于矩阵的核心运算需要执行非MVM操作,而模拟PUM无法直接完成这些操作。为保留其能效优势,模拟PUM架构通过在存储阵列中集成基于CMOS的领域专用固定功能硬件来提供完整的核心功能,但将此类专用CMOS逻辑与存储阵列集成的难度,很大程度上将模拟PUM局限于作为机器学习推理或紧密相关核心运算的加速器。当前存在利用模拟PUM进行通用计算的契机:近期研究表明,存储阵列也能执行布尔PUM运算,但其所依赖的支持硬件和电信号与模拟PUM截然不同。我们提出DARTH-PUM——一种通用混合PUM架构,旨在解决整合模拟PUM与数字PUM的关键硬件与软件挑战。我们设计了优化的外围电路、用于协调两类PUM管理与接口的协调硬件、易用的编程接口,以及对可变数据宽度的低成本支持。这些设计要素使我们能够构建一种实用的PUM架构,该架构可在存储器内完全执行核心运算,并能轻松扩展以适应从嵌入式应用到大规模数据驱动计算等不同领域。我们展示了三种流行应用(AES加密、卷积神经网络、大语言模型)如何映射至DARTH-PUM并从中获益:与模拟+CPU基线相比,其加速比分别达到59.4倍、14.8倍和40.8倍。