A complete discussion on fully reconfigurable, digital, scalable, graph and sparsity-aware near-memory accelerator for graph neural networks

Graph neural networks (GNNs) have gained significant interest for applications such as citation network analysis and drug discovery due to their ability to apply machine learning techniques on graph-structured data. GNNs typically employ a two-stage execution pipeline consisting of combination and aggregation kernels. The combination stage performs data-intensive convolution operations with relatively regular memory access patterns, whereas the aggregation stage operates on sparse graph data with highly irregular accesses. These heterogeneous memory behaviors make conventional CPU- and GPU-based execution energy inefficient due to substantial data movement overheads. Existing accelerators attempt to mitigate these challenges using specialized architectures and processing-in-memory (PIM) techniques. However, prior approaches often suffer from scalability limitations, area overheads, restricted parallelism, and energy inefficiencies associated with analog compute and dedicated accelerator structures. This paper presents NEM-GNN, a scalable DAC/ADC-less processing-in-memory architecture for graph neural network acceleration. The proposed design introduces early compute termination mechanisms, pre-computation using reconfigurable system-on-chip components, and graph- and sparsity-aware near-memory aggregation using a compute-as-soon-as-ready (CAR) and broadcast-based execution model. Experimental results demonstrate that NEM-GNN achieves approximately 80--230x higher performance, 80--300x higher throughput, 850--1134x better energy efficiency, and 7--8x higher compute density compared to prior state-of-the-art approaches.

翻译：图神经网络（GNN）因其能够对图结构数据应用机器学习技术，在引文网络分析和药物发现等应用中引起了广泛关注。GNN通常采用由组合核和聚合核组成的两阶段执行流水线。组合阶段执行数据密集型的卷积操作，其内存访问模式相对规整；而聚合阶段则在高度不规则访问的稀疏图数据上进行操作。这些异构的内存行为导致基于传统CPU和GPU的执行因大量数据搬运开销而能效低下。现有加速器尝试通过专用架构和存内处理（PIM）技术来缓解这些挑战。然而，先前的方法常受限于可扩展性、面积开销、受限的并行度，以及与模拟计算和专用加速器结构相关的能效低下问题。本文提出了NEM-GNN，一种用于图神经网络加速的可扩展且无需DAC/ADC的存内处理架构。所提出的设计引入了早期计算终止机制、使用可重构片上系统组件的预计算，以及采用“就绪即计算”（CAR）和基于广播的执行模型的图与稀疏感知近存聚合。实验结果表明，与先前最先进的方法相比，NEM-GNN在性能、吞吐量、能效和计算密度上分别实现了约80--230倍、80--300倍、850--1134倍和7--8倍的提升。