Graph-structured data is prevalent in the real world. Recently, due to the powerful emergent capabilities, Large Language Models (LLMs) have shown promising performance in modeling graphs. The key to effectively applying LLMs on graphs is converting graph data into a format LLMs can comprehend. Graph-to-token approaches are popular in enabling LLMs to process graph information. They transform graphs into sequences of tokens and align them with text tokens through instruction tuning, where self-supervised instruction tuning helps LLMs acquire general knowledge about graphs, and supervised fine-tuning specializes LLMs for the downstream tasks on graphs. Despite their initial success, we find that existing methods have a misalignment between self-supervised tasks and supervised downstream tasks, resulting in negative transfer from self-supervised fine-tuning to downstream tasks. To address these issues, we propose Graph Alignment Large Language Models (GALLM) to benefit from aligned task templates. In the self-supervised tuning stage, we introduce a novel text matching task using templates aligned with downstream tasks. In the task-specific tuning stage, we propose two category prompt methods that learn supervision information from additional explanation with further aligned templates. Experimental evaluations on four datasets demonstrate substantial improvements in supervised learning, multi-dataset generalizability, and particularly in zero-shot capability, highlighting the model's potential as a graph foundation model.
翻译:图结构数据在现实世界中普遍存在。近年来,由于强大的涌现能力,大语言模型(LLMs)在图建模方面展现出有前景的性能。将LLMs有效应用于图数据的关键在于将图数据转换为LLMs能够理解的格式。图到令牌(Graph-to-token)方法是使LLMs能够处理图信息的流行技术。它们将图转换为令牌序列,并通过指令微调将其与文本令牌对齐,其中自监督指令微调帮助LLMs获取关于图的一般知识,而有监督微调则使LLMs专门适用于图上的下游任务。尽管取得了初步成功,但我们发现现有方法在自监督任务与有监督下游任务之间存在错位,导致从自监督微调到下游任务的负迁移。为解决这些问题,我们提出了图对齐大语言模型(GALLM),以受益于对齐的任务模板。在自监督微调阶段,我们引入了一种新颖的文本匹配任务,使用与下游任务对齐的模板。在任务特定微调阶段,我们提出了两种类别提示方法,通过进一步对齐的模板从额外解释中学习监督信息。在四个数据集上的实验评估表明,该模型在有监督学习、多数据集泛化能力,特别是在零样本能力方面均取得了显著提升,凸显了其作为图基础模型的潜力。