In many real-world scenarios (e.g., academic networks, social platforms), different types of entities are not only associated with texts but also connected by various relationships, which can be abstracted as Text-Attributed Heterogeneous Graphs (TAHGs). Current pretraining tasks for Language Models (LMs) primarily focus on separately learning the textual information of each entity and overlook the crucial aspect of capturing topological connections among entities in TAHGs. In this paper, we present a new pretraining framework for LMs that explicitly considers the topological and heterogeneous information in TAHGs. Firstly, we define a context graph as neighborhoods of a target node within specific orders and propose a topology-aware pretraining task to predict nodes involved in the context graph by jointly optimizing an LM and an auxiliary heterogeneous graph neural network. Secondly, based on the observation that some nodes are text-rich while others have little text, we devise a text augmentation strategy to enrich textless nodes with their neighbors' texts for handling the imbalance issue. We conduct link prediction and node classification tasks on three datasets from various domains. Experimental results demonstrate the superiority of our approach over existing methods and the rationality of each design. Our code is available at https://github.com/Hope-Rita/THLM.
翻译:在许多实际场景中(例如学术网络、社交平台),不同类型实体不仅与文本关联,还通过多种关系相互连接,这些可被抽象为文本属性异构图(TAHG)。当前语言模型(LM)的预训练任务主要侧重于分别学习每个实体的文本信息,忽略了捕获TAHG中实体间拓扑连接这一关键方面。本文提出一种新的语言模型预训练框架,显式考虑了TAHG中的拓扑与异构信息。首先,我们定义特定阶数内目标节点邻域构成的上下文图,并提出一种拓扑感知预训练任务,通过联合优化语言模型与辅助异构图神经网络来预测上下文图中的节点。其次,基于部分节点文本丰富而其他节点文本稀少的观察,我们设计了一种文本增强策略,利用邻居文本丰富缺文节点,以解决不平衡问题。我们在三个不同领域的数据集上进行了链接预测与节点分类实验。实验结果证明了本方法相对于现有方法的优越性以及每个设计环节的合理性。我们的代码已开源至 https://github.com/Hope-Rita/THLM。