Graph neural networks are typically trained on individual datasets, often requiring highly specialized models and extensive hyperparameter tuning. This dataset-specific approach arises because each graph dataset often has unique node features and diverse connectivity structures, making it difficult to build a generalist model. To address these challenges, we introduce a scalable multi-graph multi-task pretraining approach specifically tailored for node classification tasks across diverse graph datasets from different domains. Our method, Graph Foundation Model (GraphFM), leverages a Perceiver-based encoder that employs learned latent tokens to compress domain-specific features into a common latent space. This approach enhances the model's ability to generalize across different graphs and allows for scaling across diverse data. We demonstrate the efficacy of our approach by training a model on 152 different graph datasets comprising over 7.4 million nodes and 189 million edges, establishing the first set of scaling laws for multi-graph pretraining on datasets spanning many domains (e.g., molecules, citation and product graphs). Our results show that pretraining on a diverse array of real and synthetic graphs improves the model's adaptability and stability, while performing competitively with state-of-the-art specialist models. This work illustrates that multi-graph pretraining can significantly reduce the burden imposed by the current graph training paradigm, unlocking new capabilities for the field of graph neural networks by creating a single generalist model that performs competitively across a wide range of datasets and tasks.
翻译:图神经网络通常针对单个数据集进行训练,往往需要高度专业化的模型和大量的超参数调优。这种数据集特定的方法源于每个图数据集通常具有独特的节点特征和多样化的连接结构,使得构建通用模型变得困难。为解决这些挑战,我们提出了一种可扩展的多图多任务预训练方法,专门针对来自不同领域的多样化图数据集中的节点分类任务进行定制。我们的方法——图基础模型(GraphFM)——采用基于Perceiver的编码器,利用学习到的潜在标记将领域特定特征压缩到共同的潜在空间中。这种方法增强了模型在不同图间的泛化能力,并允许跨多样化数据进行扩展。我们通过在包含超过740万个节点和1.89亿条边的152个不同图数据集上训练模型,证明了该方法的有效性,并首次建立了跨多领域数据集(例如分子、引文和产品图)进行多图预训练的扩展规律。我们的结果表明,在多样化的真实图和合成图上进行预训练,能提高模型的适应性和稳定性,同时与最先进的专用模型相比具有竞争力。这项工作表明,多图预训练能够显著减轻当前图训练范式带来的负担,通过创建在广泛数据集和任务上均具有竞争力的单一通用模型,为图神经网络领域开辟新的能力。