While great success has been achieved in building vision models with Contrastive Language-Image Pre-training (CLIP) over Internet-scale image-text pairs, building transferable Graph Neural Networks (GNNs) with CLIP pipeline is challenging because of three fundamental issues: the scarcity of labeled data and text supervision, different levels of downstream tasks, and the conceptual gaps between domains. In this work, to address these issues, we leverage multi-modal prompt learning to effectively adapt pre-trained GNN to downstream tasks and data, given only a few semantically labeled samples, each with extremely weak text supervision. Our new paradigm embeds the graphs directly in the same space as the Large Language Models (LLMs) by learning both graph prompts and text prompts simultaneously. To accomplish this, we improve state-of-the-art graph prompt method, and then propose the first graph-language multi-modal prompt learning approach for exploiting the knowledge in pre-trained models. Notably, due to the insufficient supervision for fine-tuning, in our paradigm, the pre-trained GNN and the LLM are kept frozen, so the learnable parameters are much fewer than fine-tuning any pre-trained model. Through extensive experiments on real-world datasets, we demonstrate the superior performance of our paradigm in few-shot, multi-task-level, and cross-domain settings. Moreover, we build the first CLIP-style zero-shot classification prototype that can generalize GNNs to unseen classes with extremely weak text supervision.
翻译:尽管通过对比语言-图像预训练(CLIP)在互联网规模的图文对上构建视觉模型已取得巨大成功,但利用CLIP流程构建可迁移的图神经网络(GNNs)仍面临三个根本性挑战:标注数据与文本监督的稀缺性、下游任务的不同层级以及领域间的概念鸿沟。本研究针对这些问题,提出利用多模态提示学习,在仅给定少量语义标注样本(每个样本仅含极弱文本监督)的条件下,有效将预训练GNN适配至下游任务与数据。我们的新范式通过同时学习图提示与文本提示,将图直接嵌入到与大型语言模型(LLMs)相同的语义空间中。为实现这一目标,我们改进了当前最先进的图提示方法,并首次提出图-语言多模态提示学习方法以充分挖掘预训练模型中的知识。值得注意的是,由于微调所需的监督信息不足,在我们的范式中,预训练GNN与LLM均保持冻结状态,因此可学习参数量远少于微调任何预训练模型。通过在真实数据集上的大量实验,我们证明了该范式在少样本、多任务层级及跨域场景下的卓越性能。此外,我们构建了首个CLIP风格的零样本分类原型,能够在极弱文本监督下将GNN泛化至未见类别。