UniKG: A Benchmark and Universal Embedding for Large-Scale Knowledge Graphs

Irregular data in real-world are usually organized as heterogeneous graphs (HGs) consisting of multiple types of nodes and edges. To explore useful knowledge from real-world data, both the large-scale encyclopedic HG datasets and corresponding effective learning methods are crucial, but haven't been well investigated. In this paper, we construct a large-scale HG benchmark dataset named UniKG from Wikidata to facilitate knowledge mining and heterogeneous graph representation learning. Overall, UniKG contains more than 77 million multi-attribute entities and 2000 diverse association types, which significantly surpasses the scale of existing HG datasets. To perform effective learning on the large-scale UniKG, two key measures are taken, including (i) the semantic alignment strategy for multi-attribute entities, which projects the feature description of multi-attribute nodes into a common embedding space to facilitate node aggregation in a large receptive field; (ii) proposing a novel plug-and-play anisotropy propagation module (APM) to learn effective multi-hop anisotropy propagation kernels, which extends methods of large-scale homogeneous graphs to heterogeneous graphs. These two strategies enable efficient information propagation among a tremendous number of multi-attribute entities and meantimes adaptively mine multi-attribute association through the multi-hop aggregation in large-scale HGs. We set up a node classification task on our UniKG dataset, and evaluate multiple baseline methods which are constructed by embedding our APM into large-scale homogenous graph learning methods. Our UniKG dataset and the baseline codes have been released at https://github.com/Yide-Qiu/UniKG.

翻译：现实世界中的非规则数据通常组织为包含多种节点和边类型的异构图表（HGs）。为了从真实数据中挖掘有用知识，大规模百科级异构图表数据集及相应的高效学习方法至关重要，但尚未得到充分研究。本文基于Wikidata构建大规模异构图表基准数据集UniKG，以促进知识挖掘与异构图表表示学习。总体而言，UniKG包含超过7700万个多属性实体和2000种不同的关联类型，规模显著超越现有异构图表数据集。为在大型UniKG上实现高效学习，我们采取两项关键措施：（i）多属性实体的语义对齐策略，将多属性节点的特征描述投影至统一嵌入空间，以促进大感受野内的节点聚合；（ii）提出新型即插即用各向异性传播模块（APM），学习高效的多跳各向异性传播核，将大规模同构图方法扩展至异构图表。这两种策略能够实现海量多属性实体间的高效信息传播，同时通过大规模异构图表中的多跳聚合自适应挖掘多属性关联。我们在UniKG数据集上设立节点分类任务，并评估了将APM嵌入大规模同构图学习方法后构建的多个基线方法。UniKG数据集及基线代码已发布至https://github.com/Yide-Qiu/UniKG。