GraphFM: A generalist graph transformer that learns transferable representations across diverse domains

Graph neural networks (GNNs) are often trained on individual datasets, requiring specialized models and significant hyperparameter tuning due to the unique structures and features of each dataset. This approach limits the scalability and generalizability of GNNs, as models must be tailored for each specific graph type. To address these challenges, we introduce GraphFM, a scalable multi-graph pretraining approach designed for learning across diverse graph datasets. GraphFM uses a Perceiver-based encoder with learned latent tokens to compress domain-specific features into a shared latent space, enabling generalization across graph domains. We propose new techniques for scaling up graph training on datasets of different sizes, allowing us to train GraphFM on 152 distinct graph datasets, containing a total of 7.4 million nodes and 189 million edges. This allows us to study the effect of scale on pretraining across domains such as molecules, citation networks, and product graphs, and show that training on diverse datasets improves performance over single-source pretraining. Additionally, pretraining with a mixture of synthetic and real graphs enhances adaptability and stability, leading to competitive performance with state-of-the-art models across various node classification tasks. This approach reduces the burden of dataset-specific training and provides a single generalist model capable of performing across multiple diverse graph structures and tasks. Code is available at https://github.com/nerdslab/GraphFM.

翻译：图神经网络通常针对单个数据集进行训练，由于每个数据集具有独特的结构和特征，需要专门的模型和大量的超参数调整。这种方法限制了图神经网络的可扩展性和泛化能力，因为模型必须为每种特定的图类型进行定制。为解决这些挑战，我们提出了GraphFM，一种可扩展的多图预训练方法，旨在跨不同图数据集进行学习。GraphFM采用基于Perceiver的编码器，通过学习到的潜在标记将领域特定特征压缩到共享的潜在空间中，从而实现跨图领域的泛化。我们提出了新技术，用于在不同规模的数据集上扩展图训练，使我们能够在152个不同的图数据集上训练GraphFM，总计包含740万个节点和1.89亿条边。这使我们能够研究规模对分子、引文网络和产品图等领域预训练的影响，并证明在多样化数据集上训练相比单一来源预训练能提升性能。此外，结合合成图与真实图的混合预训练增强了模型的适应性和稳定性，在各种节点分类任务中实现了与最先进模型相竞争的性能。该方法减轻了针对特定数据集训练的负担，提供了一个能够在多种不同图结构和任务中执行的单一通用模型。代码可在https://github.com/nerdslab/GraphFM获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【AAAI2025】利用大型语言模型引导异构图表示学习：一种通用方法

专知会员服务

25+阅读 · 2024年12月12日

「图Transformers」综述

专知会员服务

28+阅读 · 2024年7月16日

【KDD2023】GraphGLOW：面向图神经网络的通用和可泛化的结构学习

专知会员服务

30+阅读 · 2023年6月24日