Graph neural networks (GNNs) have gained significant interest for applications such as citation network analysis and drug discovery due to their ability to apply machine learning techniques on graph-structured data. GNNs typically employ a two-stage execution pipeline consisting of combination and aggregation kernels. The combination stage performs data-intensive convolution operations with relatively regular memory access patterns, whereas the aggregation stage operates on sparse graph data with highly irregular accesses. These heterogeneous memory behaviors make conventional CPU- and GPU-based execution energy inefficient due to substantial data movement overheads. Existing accelerators attempt to mitigate these challenges using specialized architectures and processing-in-memory (PIM) techniques. However, prior approaches often suffer from scalability limitations, area overheads, restricted parallelism, and energy inefficiencies associated with analog compute and dedicated accelerator structures. This paper presents NEM-GNN, a scalable DAC/ADC-less processing-in-memory architecture for graph neural network acceleration. The proposed design introduces early compute termination mechanisms, pre-computation using reconfigurable system-on-chip components, and graph- and sparsity-aware near-memory aggregation using a compute-as-soon-as-ready (CAR) and broadcast-based execution model. Experimental results demonstrate that NEM-GNN achieves approximately 80--230x higher performance, 80--300x higher throughput, 850--1134x better energy efficiency, and 7--8x higher compute density compared to prior state-of-the-art approaches.
翻译:图神经网络(GNN)因其对图结构数据应用机器学习技术的能力,在引文网络分析和药物发现等应用中引起了广泛关注。GNN通常采用两阶段执行流水线,包括组合核和聚合核。组合阶段执行数据密集型卷积运算,具有相对规则的访存模式;而聚合阶段则对高度不规则访问的稀疏图数据进行操作。这些异构的内存行为使得传统的基于CPU和GPU的执行因大量数据移动开销而能效低下。现有加速器尝试通过专用架构和存内处理(PIM)技术缓解这些挑战。然而,先前的方法通常存在可扩展性限制、面积开销、并行性受限以及与模拟计算和专用加速器结构相关的能效低下问题。本文提出了NEM-GNN,一种用于图神经网络加速的可扩展、无需DAC/ADC的存内处理架构。该设计引入了早期计算终止机制、使用可重构片上系统组件的预计算,以及基于“就绪即计算”(CAR)和广播执行模型的图与稀疏感知近存聚合。实验结果表明,与现有最优方法相比,NEM-GNN在性能上提升约80–230倍,吞吐量提升80–300倍,能效改善850–1134倍,计算密度提升7–8倍。