High Bandwidth Memory with Processing-in-Memory (HBM-PIM) offers an opportunity to reduce data movement by executing computation directly inside memory, but current commercial platforms expose limited instruction sets and require specialized software stacks. In this work, we investigate whether HBM-PIM can serve as a backend for ISA-level matrix acceleration, using the RISC-V Attached Matrix Extension (AME) as a semantic reference. We propose a PEP-based execution model that maps AME element-wise and matrix instructions to HBM-PIM micro-kernels and data instructions in memory operations. Differently from SoA HBM-PIM, we introduce a reduction-free outer-product dataflow that enables accumulation entirely within memory despite the lack of native reduction support. Our approach supports end-to-end execution of element-wise operations, GEMV, and GEMM in PIM mode, minimizing host involvement and off-chip transfers. An experimental evaluation on Samsung Aquabolt-XL shows that AME matrix tile multiplication achieves up to 14.9 GFLOP/s (59.4 FLOP/cycle) on a single HBM pseudo-channel.
翻译:高带宽内存与处理中内存技术(HBM-PIM)通过直接在内存中执行计算,为减少数据传输提供了机遇,但当前商用平台仅支持有限指令集,并需要专用软件栈。本文基于RISC-V附加矩阵扩展(AME)作为语义参考,探究HBM-PIM能否作为ISA级矩阵加速后端。我们提出基于PEP的执行模型,将AME逐元素与矩阵指令映射至HBM-PIM微内核及内存操作中的数据指令。与现有HBM-PIM方案不同,我们引入无约简外积数据流,尽管缺乏原生约简支持,仍能在内存内完全实现累加。该方法支持PIM模式下逐元素运算、GEMV和GEMM的端到端执行,最大程度减少主机参与和片外传输。在三星Aquabolt-XL上的实验评估表明,AME矩阵分块乘法在单HBM伪通道上可达14.9 GFLOP/s(59.4 FLOP/周期)。