While Large Language Models (LLMs) have shown exceptional generalization capabilities, their ability to process graph data, such as molecular structures, remains limited. To bridge this gap, this paper proposes Graph2Token, an efficient solution that aligns graph tokens to LLM tokens. The key idea is to represent a graph token with the LLM token vocabulary, without fine-tuning the LLM backbone. To achieve this goal, we first construct a molecule-text paired dataset from multisources, including CHEBI and HMDB, to train a graph structure encoder, which reduces the distance between graphs and texts representations in the feature space. Then, we propose a novel alignment strategy that associates a graph token with LLM tokens. To further unleash the potential of LLMs, we collect molecular IUPAC name identifiers, which are incorporated into the LLM prompts. By aligning molecular graphs as special tokens, we can activate LLM generalization ability to molecular few-shot learning. Extensive experiments on molecular classification and regression tasks demonstrate the effectiveness of our proposed Graph2Token.
翻译:尽管大语言模型(LLMs)已展现出卓越的泛化能力,但其处理图数据(如分子结构)的能力仍然有限。为弥合这一差距,本文提出Graph2Token——一种将图标记与LLM标记对齐的高效解决方案。其核心思想是在不微调LLM主干网络的前提下,使用LLM标记词汇表来表示图标记。为实现这一目标,我们首先从CHEBI、HMDB等多源数据库构建分子-文本配对数据集,用于训练图结构编码器,从而缩小特征空间中图表示与文本表示的距离。随后,我们提出一种新颖的对齐策略,将图标记与LLM标记相关联。为进一步释放LLMs的潜力,我们收集了分子IUPAC名称标识符并将其整合到LLM提示中。通过将分子图对齐为特殊标记,我们能够激活LLMs在分子少样本学习中的泛化能力。在分子分类与回归任务上的大量实验证明了我们提出的Graph2Token方法的有效性。