Graph Neural Networks (GNNs) play a crucial role in various fields. However, most existing deep graph learning frameworks assume pre-stored static graphs and do not support training on graph streams. In contrast, many real-world graphs are dynamic and contain time domain information. We introduce GNNFlow, a distributed framework that enables efficient continuous temporal graph representation learning on dynamic graphs on multi-GPU machines. GNNFlow introduces an adaptive time-indexed block-based data structure that effectively balances memory usage with graph update and sampling operation efficiency. It features a hybrid GPU-CPU graph data placement for rapid GPU-based temporal neighborhood sampling and kernel optimizations for enhanced sampling processes. A dynamic GPU cache for node and edge features is developed to maximize cache hit rates through reuse and restoration strategies. GNNFlow supports distributed training across multiple machines with static scheduling to ensure load balance. We implement GNNFlow based on DGL and PyTorch. Our experimental results show that GNNFlow provides up to 21.1x faster continuous learning than existing systems.
翻译:图神经网络(GNN)在多个领域发挥着关键作用。然而,大多数现有的深度图学习框架假设图是预先存储的静态图,不支持图流上的训练。相比之下,许多现实世界的图具有动态特性,并包含时间域信息。我们提出了GNNFlow,一个分布式框架,能够在多GPU机器上实现动态图的高效连续时序图表示学习。GNNFlow引入了一种基于自适应时间索引块的数据结构,有效平衡了内存使用与图更新及采样操作的效率。它采用混合GPU-CPU图数据放置策略,支持基于GPU的快速时序邻居采样,并通过内核优化提升采样过程。我们开发了针对节点和边特征的动态GPU缓存,通过重用和恢复策略最大化缓存命中率。GNNFlow通过静态调度支持跨多台机器的分布式训练,确保负载均衡。我们基于DGL和PyTorch实现了GNNFlow。实验结果表明,与现有系统相比,GNNFlow在连续学习任务上可提供高达21.1倍的加速。