Large-scale Graph Neural Networks (GNNs) are typically trained by sampling a vertex's neighbors to a fixed distance. Because large input graphs are distributed, training requires frequent irregular communication that stalls forward progress. Moreover, fetched data changes with graph, graph distribution, sample and batch parameters, and caching polices. Consequently, any static prefetching method will miss crucial opportunities to adapt to different dynamic conditions. In this paper, we introduce Rudder, a software module embedded in the state-of-the-art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication. Rudder's adaptation contrasts with both standard heuristics and traditional ML classifiers. We observe that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In-Context Learning (ICL) for zero-shot tasks, with logical multi-step reasoning. We find this behavior well-suited for adaptive control even with substantial undertraining. Evaluations using standard datasets and unseen configurations on the NERSC Perlmutter supercomputer show up to 91% improvement in end-to-end training performance over baseline DistDGL (no prefetching), and an 82% improvement over static prefetching, reducing communication by over 50%. Our code is available at https://github.com/aishwaryyasarkar/rudder-llm-agent.
翻译:大规模图神经网络(GNN)的训练通常通过对顶点的邻居进行固定距离的采样来实现。由于输入图规模庞大且呈分布式存储,训练过程需要频繁进行不规则的通信,这会阻碍前向计算的进展。此外,获取的数据会随着图结构、图分布方式、采样与批次参数以及缓存策略的变化而改变。因此,任何静态预取方法都将错失适应不同动态条件的关键机会。本文提出Rudder——一个嵌入先进框架AWS DistDGL中的软件模块,旨在自主预取远程节点并最小化通信开销。Rudder的适应机制既不同于标准启发式方法,也区别于传统机器学习分类器。我们观察到,当代大型语言模型(LLM)中生成式人工智能展现出如情境学习(ICL)等新兴特性,能够进行零样本任务下的逻辑多步推理。我们发现这种行为即使在大规模欠训练条件下,仍非常适用于自适应控制。在NERSC Perlmutter超级计算机上使用标准数据集及未见配置进行的评估表明:相较于基准DistDGL(无预取),端到端训练性能最高提升91%;相较于静态预取方法提升82%,同时减少超过50%的通信开销。我们的代码公开于https://github.com/aishwaryyasarkar/rudder-llm-agent。