Graph foundation models have demonstrated remarkable adaptability across diverse downstream tasks through large-scale pretraining on graphs. However, existing implementations of the backbone model, graph transformers, are typically limited to single-GPU systems, leading to long training times or out-of-memory issues on large graphs. Moreover, parallelizing graph transformer training over the full graph is challenging, as efficiency depends heavily on both the graph structure and system characteristics, such as bandwidth and memory capacity. In this work, we introduce a distributed training framework for graph transformers, which automatically selects and optimizes parallelization strategies based on the graph structure and hardware configuration. With our implementation of distributed sparse operations, we accelerate sparse graph attention by up to 3.8x and reduce memory consumption by 78% compared to state-of-the-art frameworks. On large graph benchmarks, our proposed framework achieves up to 6x speedup with system scaling up to 8 GPUs. These results demonstrate that the proposed framework improves the scalability of graph transformers, bringing them closer to serving as practical graph foundation models.
翻译:图基础模型通过在大规模图上的预训练,已在多种下游任务中展现出卓越的自适应性。然而,现有骨干模型——图Transformer的实现通常局限于单GPU系统,导致在大图上训练时间过长或出现内存不足问题。此外,在全图上并行化图Transformer训练具有挑战性,因为效率高度依赖于图结构和系统特性(如带宽和内存容量)。本文提出了一种针对图Transformer的分布式训练框架,该框架能够根据图结构和硬件配置自动选择并优化并行化策略。通过我们实现的分布式稀疏操作,与最先进框架相比,稀疏图注意力计算加速了高达3.8倍,内存消耗减少了78%。在大规模图基准测试中,所提框架在扩展到8个GPU的系统上实现了高达6倍的加速。这些结果表明,该框架提升了图Transformer的可扩展性,使其更接近作为实用图基础模型的目标。