Model pre-training on large text corpora has been demonstrated effective for various downstream applications in the NLP domain. In the graph mining domain, a similar analogy can be drawn for pre-training graph models on large graphs in the hope of benefiting downstream graph applications, which has also been explored by several recent studies. However, no existing study has ever investigated the pre-training of text plus graph models on large heterogeneous graphs with abundant textual information (a.k.a. large graph corpora) and then fine-tuning the model on different related downstream applications with different graph schemas. To address this problem, we propose a framework of graph-aware language model pre-training (GALM) on a large graph corpus, which incorporates large language models and graph neural networks, and a variety of fine-tuning methods on downstream applications. We conduct extensive experiments on Amazon's real internal datasets and large public datasets. Comprehensive empirical results and in-depth analysis demonstrate the effectiveness of our proposed methods along with lessons learned.
翻译:在自然语言处理领域,基于大型文本语料库的模型预训练已被证明对多种下游应用有效。在图挖掘领域,存在类似的情境:可以在大型图上预训练图模型,以期惠及下游图应用,这也是近期若干研究探索的方向。然而,尚无现有研究探讨在包含丰富文本信息的大型异构图(即大型图语料库)上预训练文本与图联合模型,并针对具有不同图模式的相关下游应用进行微调。为解决此问题,我们提出了一种面向大型图语料库的图感知语言模型预训练(GALM)框架,该框架融合了大型语言模型与图神经网络,并提供了多种针对下游应用的微调方法。我们在亚马逊真实内部数据集和大型公开数据集上进行了广泛实验。全面的实证结果与深入分析证明了所提方法的有效性,并总结了相关经验教训。