When to Pre-Train Graph Neural Networks? From Data Generation Perspective!

In recent years, graph pre-training has gained significant attention, focusing on acquiring transferable knowledge from unlabeled graph data to improve downstream performance. Despite these recent endeavors, the problem of negative transfer remains a major concern when utilizing graph pre-trained models to downstream tasks. Previous studies made great efforts on the issue of what to pre-train and how to pre-train by designing a variety of graph pre-training and fine-tuning strategies. However, there are cases where even the most advanced "pre-train and fine-tune" paradigms fail to yield distinct benefits. This paper introduces a generic framework W2PGNN to answer the crucial question of when to pre-train (i.e., in what situations could we take advantage of graph pre-training) before performing effortful pre-training or fine-tuning. We start from a new perspective to explore the complex generative mechanisms from the pre-training data to downstream data. In particular, W2PGNN first fits the pre-training data into graphon bases, each element of graphon basis (i.e., a graphon) identifies a fundamental transferable pattern shared by a collection of pre-training graphs. All convex combinations of graphon bases give rise to a generator space, from which graphs generated form the solution space for those downstream data that can benefit from pre-training. In this manner, the feasibility of pre-training can be quantified as the generation probability of the downstream data from any generator in the generator space. W2PGNN offers three broad applications: providing the application scope of graph pre-trained models, quantifying the feasibility of pre-training, and assistance in selecting pre-training data to enhance downstream performance. We provide a theoretically sound solution for the first application and extensive empirical justifications for the latter two applications.

翻译：近年来，图预训练技术受到广泛关注，其核心目标是从无标注图数据中获取可迁移知识，以提升下游任务性能。尽管已有诸多探索，但利用图预训练模型处理下游任务时，负迁移问题仍是主要挑战。现有研究通过设计多样化的图预训练与微调策略，在“预训练什么”和“如何预训练”方面取得了显著进展。然而，即便采用最先进的“预训练-微调”范式，某些场景下仍难以获得明显收益。本文提出通用框架W2PGNN，旨在回答“何时预训练”这一关键问题（即在何种情况下图预训练能带来优势），从而避免无效的预训练或微调投入。我们从全新视角出发，探究预训练数据至下游数据的复杂生成机制。具体而言，W2PGNN首先将预训练数据拟合至图基（graphon bases），其中每个图基元素（即一个图函数）表征一组预训练图共享的基本可迁移模式。所有图基的凸组合构成生成器空间，由此生成的图集为可从预训练中受益的下游数据形成解空间。通过此方法，预训练的可行性可量化为生成器空间中任意生成器生成下游数据的概率。W2PGNN提供三大应用场景：界定图预训练模型的适用范围、量化预训练可行性、辅助筛选预训练数据以提升下游性能。我们为第一项应用提供了理论严谨的解决方案，并对后两项应用进行了充分的实证验证。