Task-Oriented GNNs Training on Large Knowledge Graphs for Accurate and Efficient Modeling

A Knowledge Graph (KG) is a heterogeneous graph encompassing a diverse range of node and edge types. Heterogeneous Graph Neural Networks (HGNNs) are popular for training machine learning tasks like node classification and link prediction on KGs. However, HGNN methods exhibit excessive complexity influenced by the KG's size, density, and the number of node and edge types. AI practitioners handcraft a subgraph of a KG G relevant to a specific task. We refer to this subgraph as a task-oriented subgraph (TOSG), which contains a subset of task-related node and edge types in G. Training the task using TOSG instead of G alleviates the excessive computation required for a large KG. Crafting the TOSG demands a deep understanding of the KG's structure and the task's objectives. Hence, it is challenging and time-consuming. This paper proposes KG-TOSA, an approach to automate the TOSG extraction for task-oriented HGNN training on a large KG. In KG-TOSA, we define a generic graph pattern that captures the KG's local and global structure relevant to a specific task. We explore different techniques to extract subgraphs matching our graph pattern: namely (i) two techniques sampling around targeted nodes using biased random walk or influence scores, and (ii) a SPARQL-based extraction method leveraging RDF engines' built-in indices. Hence, it achieves negligible preprocessing overhead compared to the sampling techniques. We develop a benchmark of real KGs of large sizes and various tasks for node classification and link prediction. Our experiments show that KG-TOSA helps state-of-the-art HGNN methods reduce training time and memory usage by up to 70% while improving the model performance, e.g., accuracy and inference time.

翻译：知识图谱（KG）是一种包含多种节点与边类型的异构图。异构图神经网络（HGNN）在知识图谱上执行节点分类、链接预测等机器学习任务中应用广泛。然而，HGNN方法的计算复杂度会随着知识图谱的规模、密度以及节点和边类型的数量呈指数级增长。人工智能从业者需手工构建与特定任务相关的知识图谱子图，我们将其称为面向任务的子图（TOSG），该子图仅包含原始知识图谱中与任务相关的节点和边类型。使用TOSG而非完整知识图谱进行模型训练，可有效缓解大规模知识图谱带来的计算压力。但TOSG的构建需要深刻理解知识图谱结构与任务目标，这使其成为一项具有挑战性的耗时工作。本文提出KG-TOSA方法，通过自动化提取TOSG实现大规模知识图谱上的面向任务HGNN训练。在KG-TOSA中，我们定义了一个通用图模式来捕捉与特定任务相关的知识图谱局部与全局结构。我们探索了三种匹配该图模式的子图提取技术：（i）基于有偏随机游走或影响力分数的目标节点采样技术；（ii）利用RDF引擎内置索引的SPARQL抽取方法。其中SPARQL方法相比采样技术可实现可忽略的预处理开销。我们构建了包含大规模真实知识图谱及多种节点分类/链接预测任务的基准测试集。实验表明，KG-TOSA能使现有最优HGNN方法的训练时间与内存占用降低最高70%，同时提升模型性能（如准确率与推理速度）。