A complete discussion on fully reconfigurable, digital, scalable, graph and sparsity-aware near-memory accelerator for graph neural networks

Graph neural networks (GNNs) have gained significant interest for applications such as citation network analysis and drug discovery due to their ability to apply machine learning techniques on graph-structured data. GNNs typically employ a two-stage execution pipeline consisting of combination and aggregation kernels. The combination stage performs data-intensive convolution operations with relatively regular memory access patterns, whereas the aggregation stage operates on sparse graph data with highly irregular accesses. These heterogeneous memory behaviors make conventional CPU- and GPU-based execution energy inefficient due to substantial data movement overheads. Existing accelerators attempt to mitigate these challenges using specialized architectures and processing-in-memory (PIM) techniques. However, prior approaches often suffer from scalability limitations, area overheads, restricted parallelism, and energy inefficiencies associated with analog compute and dedicated accelerator structures. This paper presents NEM-GNN, a scalable DAC/ADC-less processing-in-memory architecture for graph neural network acceleration. The proposed design introduces early compute termination mechanisms, pre-computation using reconfigurable system-on-chip components, and graph- and sparsity-aware near-memory aggregation using a compute-as-soon-as-ready (CAR) and broadcast-based execution model. Experimental results demonstrate that NEM-GNN achieves approximately 80--230x higher performance, 80--300x higher throughput, 850--1134x better energy efficiency, and 7--8x higher compute density compared to prior state-of-the-art approaches.

翻译：图神经网络（GNN）因其对图结构数据应用机器学习技术的能力，在引文网络分析和药物发现等应用中引起了广泛关注。GNN通常采用两阶段执行流水线，包括组合核和聚合核。组合阶段执行数据密集型卷积运算，具有相对规则的访存模式；而聚合阶段则对高度不规则访问的稀疏图数据进行操作。这些异构的内存行为使得传统的基于CPU和GPU的执行因大量数据移动开销而能效低下。现有加速器尝试通过专用架构和存内处理（PIM）技术缓解这些挑战。然而，先前的方法通常存在可扩展性限制、面积开销、并行性受限以及与模拟计算和专用加速器结构相关的能效低下问题。本文提出了NEM-GNN，一种用于图神经网络加速的可扩展、无需DAC/ADC的存内处理架构。该设计引入了早期计算终止机制、使用可重构片上系统组件的预计算，以及基于“就绪即计算”（CAR）和广播执行模型的图与稀疏感知近存聚合。实验结果表明，与现有最优方法相比，NEM-GNN在性能上提升约80–230倍，吞吐量提升80–300倍，能效改善850–1134倍，计算密度提升7–8倍。