This work develops a distributed graph neural network (GNN) methodology for mesh-based modeling applications using a consistent neural message passing layer. As the name implies, the focus is on enabling scalable operations that satisfy physical consistency via halo nodes at sub-graph boundaries. Here, consistency refers to the fact that a GNN trained and evaluated on one rank (one large graph) is arithmetically equivalent to evaluations on multiple ranks (a partitioned graph). This concept is demonstrated by interfacing GNNs with NekRS, a GPU-capable exascale CFD solver developed at Argonne National Laboratory. It is shown how the NekRS mesh partitioning can be linked to the distributed GNN training and inference routines, resulting in a scalable mesh-based data-driven modeling workflow. We study the impact of consistency on the scalability of mesh-based GNNs, demonstrating efficient scaling in consistent GNNs for up to O(1B) graph nodes on the Frontier exascale supercomputer.
翻译:本研究开发了一种用于基于网格建模应用的分布式图神经网络(GNN)方法,该方法采用了一致的神经消息传递层。顾名思义,其重点在于通过子图边界处的halo节点实现满足物理一致性的可扩展操作。此处,一致性是指在一个计算秩(一个大型图)上训练和评估的GNN,在算术上等价于在多个计算秩(一个分区图)上进行的评估。通过将GNN与NekRS(阿贡国家实验室开发的具备GPU能力的百亿亿次计算CFD求解器)进行接口,论证了这一概念。研究展示了如何将NekRS网格分区与分布式GNN训练和推理例程关联起来,从而形成一个可扩展的基于网格的数据驱动建模工作流。我们研究了一致性对基于网格的GNN可扩展性的影响,在Frontier百亿亿次超级计算机上,针对高达O(10亿)量级的图节点,证明了具有一致性的GNN能够实现高效扩展。