Representation learning on text-attributed graphs (TAGs), where nodes are represented by textual descriptions, is crucial for textual and relational knowledge systems and recommendation systems. Currently, state-of-the-art embedding methods for TAGs primarily focus on fine-tuning language models (e.g., BERT) using structure-aware training signals. While effective, these methods are tailored for individual TAG and cannot generalize across various graph scenarios. Given the shared textual space, leveraging multiple TAGs for joint fine-tuning, aligning text and graph structure from different aspects, would be more beneficial. Motivated by this, we introduce a novel Unified Graph Language Model (UniGLM) framework, the first graph embedding model that generalizes well to both in-domain and cross-domain TAGs. Specifically, UniGLM is trained over multiple TAGs with different domains and scales using self-supervised contrastive learning. UniGLM includes an adaptive positive sample selection technique for identifying structurally similar nodes and a lazy contrastive module that is devised to accelerate training by minimizing repetitive encoding calculations. Extensive empirical results across 9 benchmark TAGs demonstrate UniGLM's efficacy against leading embedding baselines in terms of generalization (various downstream tasks and backbones) and transfer learning (in and out of domain scenarios). The code is available at https://github.com/NYUSHCS/UniGLM.
翻译:文本属性图(TAG)的表示学习,其中节点由文本描述表示,对于文本与关系知识系统以及推荐系统至关重要。当前,针对TAG的最先进嵌入方法主要侧重于利用结构感知训练信号对语言模型(例如BERT)进行微调。尽管这些方法有效,但它们专为单个TAG定制,无法泛化到不同的图场景。考虑到共享的文本空间,利用多个TAG进行联合微调,从不同方面对齐文本与图结构,将更为有益。受此启发,我们提出了一种新颖的统一图语言模型(UniGLM)框架,这是首个在域内和跨域TAG上均能良好泛化的图嵌入模型。具体而言,UniGLM通过自监督对比学习在多个不同领域和规模的TAG上进行训练。UniGLM包含一种用于识别结构相似节点的自适应正样本选择技术,以及一个旨在通过最小化重复编码计算来加速训练的惰性对比模块。在9个基准TAG上的大量实证结果表明,UniGLM在泛化能力(多种下游任务和骨干网络)和迁移学习(域内及域外场景)方面均优于主流嵌入基线。代码发布于https://github.com/NYUSHCS/UniGLM。