Graph Neural Networks (GNNs) are widely used today in recommendation systems, fraud detection, and node/link classification tasks. Real world GNNs continue to scale in size and require a large memory footprint for storing graphs and embeddings that often exceed the memory capacities of the target GPUs used for training. To address limited memory capacities, traditional GNN training approaches use graph partitioning and sharding techniques to scale up across multiple GPUs within a node and/or scale out across multiple nodes. However, this approach suffers from the high computational costs of graph partitioning algorithms and inefficient communication across GPUs. To address these overheads, we propose Large-scale Storage-based Multi-GPU GNN framework (LSM-GNN), a storage-based approach to train GNN models that utilizes a novel communication layer enabling GPU software caches to function as a system-wide shared cache with low overheads. LSM-GNN incorporates a hybrid eviction policy that intelligently manages cache space by using both static and dynamic node information to significantly enhance cache performance. Furthermore, we introduce the Preemptive Victim-buffer Prefetcher (PVP), a mechanism for prefetching node feature data from a Victim Buffer located in CPU pinned-memory to further reduce the pressure on the storage devices. Experimental results show that despite the lower compute capabilities and memory capacities, LSM-GNN in a single node with two GPUs offers superior performance over two-node-four-GPU Dist-DGL baseline and provides up to 3.75x speed up on end-to-end epoch time while running large-scale GNN training
翻译:图神经网络(GNN)如今广泛应用于推荐系统、欺诈检测以及节点/边分类任务。现实世界的GNN规模持续增大,存储图和嵌入向量所需的内存足迹庞大,常常超过训练所用目标GPU的内存容量。为应对有限内存容量问题,传统GNN训练方法采用图分区与分片技术,以在单节点内跨多GPU扩展或在多节点间横向扩展。然而,此类方法受限于图分区算法的高计算开销以及GPU间低效的通信。为解决这些额外开销,我们提出基于大容量存储的多GPU图神经网络框架LSM-GNN,这是一种基于存储的GNN模型训练方法。该方法利用新型通信层,使GPU软件缓存能够以低开销实现系统级共享缓存。LSM-GNN引入混合淘汰策略,通过同时利用静态与动态节点信息智能管理缓存空间,显著提升缓存性能。此外,我们提出抢占式受害者缓冲区预取器(PVP),该机制可从位于CPU固定内存中的受害者缓冲区预取节点特征数据,进一步减轻存储设备压力。实验结果表明,尽管单节点双GPU配置下LSM-GNN的计算能力与内存容量较低,但其在端到端周期时间上相较于分布式两节点四GPU的Dist-DGL基线展现出更优性能,并在大规模GNN训练中实现了最高3.75倍的加速。