We present our work on developing and training scalable graph foundation models (GFM) using HydraGNN, a multi-headed graph convolutional neural network architecture. HydraGNN expands the boundaries of graph neural network (GNN) in both training scale and data diversity. It abstracts over message passing algorithms, allowing both reproduction of and comparison across algorithmic innovations that define convolution in GNNs. This work discusses a series of optimizations that have allowed scaling up the GFM training to tens of thousands of GPUs on datasets that consist of hundreds of millions of graphs. Our GFMs use multi-task learning (MTL) to simultaneously learn graph-level and node-level properties of atomistic structures, such as the total energy and atomic forces. Using over 150 million atomistic structures for training, we illustrate the performance of our approach along with the lessons learned on two United States Department of Energy (US-DOE) supercomputers, namely the Perlmutter petascale system at the National Energy Research Scientific Computing Center and the Frontier exascale system at Oak Ridge National Laboratory. The HydraGNN architecture enables the GFM to achieve near-linear strong scaling performance using more than 2,000 GPUs on Perlmutter and 16,000 GPUs on Frontier. Hyperparameter optimization (HPO) was performed on over 64,000 GPUs on Frontier to select GFM architectures with high accuracy. Early stopping was applied on each GFM architecture for energy awareness in performing such an extreme-scale task. The training of an ensemble of highest-ranked GFM architectures continued until convergence to establish uncertainty quantification (UQ) capabilities with ensemble learning. Our contribution opens the door for rapidly developing, training, and deploying GFMs using large-scale computational resources to enable AI-accelerated materials discovery and design.
翻译:我们介绍了利用HydraGNN(一种多头图卷积神经网络架构)开发和训练可扩展图基础模型(GFM)的工作。HydraGNN在图神经网络(GNN)的训练规模和数据多样性两方面均拓展了其边界。它对消息传递算法进行了抽象,使得能够复现并比较那些定义GNN中卷积操作的算法创新。本文讨论了一系列优化技术,使得GFM训练能够扩展到数万个GPU,处理包含数亿个图的数据集。我们的GFM采用多任务学习(MTL),同时学习原子结构的图级和节点级属性,例如总能量和原子力。我们使用超过1.5亿个原子结构进行训练,在两个美国能源部(US-DOE)超级计算机上展示了我们方法的性能以及从中获得的经验教训,这两个系统分别是国家能源研究科学计算中心的Perlmutter千万亿次级系统和橡树岭国家实验室的Frontier百亿亿次级系统。HydraGNN架构使得GFM在Perlmutter上使用超过2,000个GPU和在Frontier上使用16,000个GPU时,能够实现接近线性的强扩展性能。我们在Frontier上使用超过64,000个GPU进行了超参数优化(HPO),以选择高精度的GFM架构。在执行如此极端规模的任务时,对每个GFM架构应用了早停法以实现能耗感知。对一组排名最高的GFM架构的训练持续到收敛,以通过集成学习建立不确定性量化(UQ)能力。我们的贡献为利用大规模计算资源快速开发、训练和部署GFM,从而实现AI加速的材料发现与设计,打开了大门。