Accelerating Generic Graph Neural Networks via Architecture, Compiler, Partition Method Co-Design

Graph neural networks (GNNs) have shown significant accuracy improvements in a variety of graph learning domains, sparking considerable research interest. To translate these accuracy improvements into practical applications, it is essential to develop high-performance and efficient hardware acceleration for GNN models. However, designing GNN accelerators faces two fundamental challenges: the high bandwidth requirement of GNN models and the diversity of GNN models. Previous works have addressed the first challenge by using more expensive memory interfaces to achieve higher bandwidth. For the second challenge, existing works either support specific GNN models or have generic designs with poor hardware utilization. In this work, we tackle both challenges simultaneously. First, we identify a new type of partition-level operator fusion, which we utilize to internally reduce the high bandwidth requirement of GNNs. Next, we introduce partition-level multi-threading to schedule the concurrent processing of graph partitions, utilizing different hardware resources. To further reduce the extra on-chip memory required by multi-threading, we propose fine-grained graph partitioning to generate denser graph partitions. Importantly, these three methods make no assumptions about the targeted GNN models, addressing the challenge of model variety. We implement these methods in a framework called SwitchBlade, consisting of a compiler, a graph partitioner, and a hardware accelerator. Our evaluation demonstrates that SwitchBlade achieves an average speedup of $1.85\times$ and energy savings of $19.03\times$ compared to the NVIDIA V100 GPU. Additionally, SwitchBlade delivers performance comparable to state-of-the-art specialized accelerators.

翻译：图神经网络（GNN）在多种图学习领域展现出显著的准确性提升，引发了广泛的研究兴趣。为将这些准确性改进转化为实际应用，必须为GNN模型开发高性能且高效的硬件加速方案。然而，设计GNN加速器面临两大根本挑战：GNN模型的高带宽需求以及模型多样性。先前的工作通过使用更昂贵的内存接口来满足更高带宽需求，从而解决了第一个挑战。针对第二个挑战，现有工作要么仅支持特定GNN模型，要么采用通用设计但硬件利用率低下。在本工作中，我们同时应对了这两项挑战。首先，我们识别出一种新型的分区级算子融合，并利用其从内部降低GNN的高带宽需求。其次，我们引入分区级多线程技术来调度图分区的并发处理，并充分利用不同硬件资源。为进一步减少多线程所需的额外片上内存，我们提出了细粒度图划分方法以生成更密集的图分区。重要的是，这三种方法均不对目标GNN模型做出任何假设，从而解决了模型多样性的挑战。我们将这些方法实现于一个名为SwitchBlade的框架中，该框架由编译器、图分区器和硬件加速器组成。评估表明，与NVIDIA V100 GPU相比，SwitchBlade实现了平均1.85倍的加速比和19.03倍的能耗节省。此外，SwitchBlade的性能与最先进的专用加速器相当。