Graph analytics are vital in fields such as social networks, biomedical research, and graph neural networks (GNNs). However, traditional CPUs and GPUs struggle with the memory bottlenecks caused by large graph datasets and their fine-grained memory accesses. While specialized graph accelerators address these challenges, they often support only moderate-sized graphs (under 500 million edges). Our paper proposes Swift, a novel scale-up graph accelerator framework that processes large graphs by leveraging the flexibility of FPGA custom datapath and memory resources, and optimizes utilization of high-bandwidth 3D memory (HBM). Swift supports up to 8 FPGAs in a node. Swift introduces a decoupled, asynchronous model based on the Gather-Apply-Scatter (GAS) scheme. It subgraphs across FPGAs, and each subgraph into intervals based on source vertex IDs. Processing on these intervals is decoupled and executed asynchronously, instead of bulk-synchonous operation, where throughput is limited by the slowest task. This enables simultaneous processing within each multi-FPGA node and optimizes the utilization of communication (PCIe), off-chip (HBM), and on-chip BRAM/URAM resources. Swift demonstrates significant performance improvements compared to prior scalable FPGA-based frameworks, performing 12.8 times better than the ForeGraph. Performance against Gunrock on NVIDIA A40 GPUs is mixed, because NVlink gives the GPU system a nearly 5X bandwidth advantage, but the FPGA system nevertheless achieves 2.6x greater energy efficiency.
翻译:图分析在社交网络、生物医学研究和图神经网络(GNN)等领域至关重要。然而,传统CPU和GPU在处理大型图数据集及其细粒度内存访问所导致的内存瓶颈方面面临困难。虽然专用图加速器能够应对这些挑战,但它们通常仅支持中等规模的图(少于5亿条边)。本文提出Swift,一种新颖的纵向扩展图加速器框架,该框架通过利用FPGA定制数据通路和内存资源的灵活性来处理大型图,并优化高带宽3D内存(HBM)的利用率。Swift支持单个节点内最多8个FPGA。Swift引入了一种基于聚集-应用-散射(GAS)方案的解耦异步模型。它将图跨FPGA进行子图划分,并根据源顶点ID将每个子图进一步划分为多个区间。对这些区间的处理是解耦且异步执行的,而非采用吞吐量受限于最慢任务的整体同步操作。这使得每个多FPGA节点内能够进行并行处理,并优化了通信(PCIe)、片外(HBM)和片上BRAM/URAM资源的利用率。与先前可扩展的基于FPGA的框架相比,Swift展现出显著的性能提升,其性能比ForeGraph高出12.8倍。与NVIDIA A40 GPU上的Gunrock相比,性能表现各有优劣,因为NVlink为GPU系统提供了近5倍的带宽优势,但FPGA系统仍实现了2.6倍更高的能效。