Near-bank Processing-in-Memory (PIM) architectures integrate processing cores (PIMcores) close to DRAM banks to mitigate the high cost of off-chip memory accesses. When accelerating convolutional neural network (CNN) on DRAM-PIM, performance is often constrained by cross-bank (or cross-PIMcore) data transfers, which are induced by the conventional layer-by-layer dataflow that enforces inter-bank (or inter-PIMcore) dependencies across successive CNN layers. To address this challenge, we propose PIMfused, a hardware-software co-design that enables fused-layer dataflow for end-to-end CNN execution in near-bank DRAM-PIM. By adopting fused-layer dataflow, PIMfused improves data reuse and, more importantly, breaks inter-bank data dependencies, thereby optimizing cross-bank data transfers without sacrificing bank-level parallelism. We study the impact of buffer sizes and PIMcore parallelism (1-bank vs. 4-bank) on PIMfused using end-to-end ResNet18. We present three key takeaways and show that with 4-bank PIMcores, PIMfused achieves overall PPA gains over a GDDR6-AiM-like baseline, cutting memory cycles to 30.6%, energy to 83.4%, and area to 76.5%.
翻译:近存储体存内计算(PIM)架构将处理核心(PIMcores)集成在DRAM存储体附近,以降低片外存储器访问的高昂开销。在DRAM-PIM上加速卷积神经网络(CNN)时,性能常受限于跨存储体(或跨PIMcore)的数据传输,这源于传统逐层数据流在连续CNN层间强制引入的存储体间(或PIMcore间)依赖。为应对这一挑战,我们提出PIMfused——一种支持近存储体DRAM-PIM中端到端CNN执行的融合层数据流的软硬件协同设计。通过采用融合层数据流,PIMfused提升了数据复用率,更重要的是打破了存储体间的数据依赖,从而在不牺牲存储体级并行性的前提下优化了跨存储体数据传输。我们基于端到端ResNet18研究了缓冲区大小与PIMcore并行度(单存储体 vs. 四存储体)对PIMfused的影响。研究提出三项关键结论,并表明在四存储体PIMcore配置下,PIMfused相较于类GDDR6-AiM基线方案实现了整体PPA(性能-功耗-面积)增益,将存储周期降低至30.6%、能耗降至83.4%、面积缩减至76.5%。