Large-scale graph neural network (GNN) training often requires distributed clusters because graph structure and feature tensors no longer fit in a single node's memory. In sampling-based training, each mini-batch expands into a receptive field that spans partitions and triggers thousands of remote feature fetches per epoch. This wastes energy for two main reasons: each small RPC pays a fixed initiation and protocol cost, and GPUs continue drawing substantial baseline power while waiting for remote features. We present GreenGNN, an energy-aware distributed GNN training system that reduces communication energy by exploiting the bursty, short-lived temporal locality of neighbor sampling. GreenGNN groups training into windows of W consecutive mini-batches, stages each window's hot features in a local cache, and merges remote requests from each partition owner into a small number of bulk transfers. This amortizes RPC overhead across many features while preserving an on-demand path for cache misses. Because window size controls the trade-off between communication amortization and hot-set staleness, GreenGNN selects W offline using a discrete-event simulator that replays a deterministic one-epoch access trace with a hybrid energy model. We implement GreenGNN on DGL and evaluate it on a 4-node GPU cluster with benchmark datasets. Across datasets and batch sizes, GreenGNN reduces total system energy by 27--43% relative to baseline while improving end-to-end throughput by up to 3.9x. GPU energy drops by 36--71%, driven by fewer RPC initiations and lower GPU stall time.
翻译:大规模图神经网络训练通常需要分布式集群,因为图结构和特征张量已无法单节点内存容纳。在基于采样的训练中,每个小批量会扩展出一个跨分区的感受野,每轮迭代触发数千次远程特征获取。这导致两种能耗浪费:每次小规模远程过程调用需支付固定的启动和协议开销,同时GPU在等待远程特征时仍持续消耗大量基线功率。我们提出GreenGNN——一种能耗感知的分布式GNN训练系统,利用邻居采样的突发性短时时间局部性降低通信能耗。GreenGNN将训练分组为连续W个小批量的窗口,将每个窗口的热点特征暂存于本地缓存,并将各分区所属的远程请求合并为少量批量传输。该方法在保持缓存未命中按需获取路径的同时,将远程过程调用开销分摊到多个特征上。由于窗口大小控制通信分摊与热集陈旧性之间的权衡,GreenGNN采用离散事件模拟器离线选择W值,该模拟器通过混合能耗模型重放确定性的单轮访问轨迹。我们在DGL框架上实现GreenGNN,并在配备基准数据集的4节点GPU集群上进行评估。相较于基线系统,GreenGNN在不同数据集与批量大小下降低总系统能耗27-43%,同时提升端到端吞吐量达3.9倍,其中GPU能耗因远程过程调用启动次数减少和停顿时间缩短而下降36-71%。