Wikipedia articles are hierarchically organized through categories and lists, providing one of the most comprehensive and universal taxonomy, but its open creation is causing redundancies and inconsistencies. Assigning DBPedia classes to Wikipedia categories and lists can alleviate the problem, realizing a large knowledge graph which is essential for categorizing digital contents through entity linking and typing. However, the existing approach of CaLiGraph is producing incomplete and non-fine grained mappings. In this paper, we tackle the problem as ontology alignment, where structural information of knowledge graphs and lexical and semantic features of ontology class names are utilized to discover confident mappings, which are in turn utilized for finetuing pretrained language models in a distant supervision fashion. Our method SLHCat consists of two main parts: 1) Automatically generating training data by leveraging knowledge graph structure, semantic similarities, and named entity typing. 2) Finetuning and prompt-tuning of the pre-trained language model BERT are carried out over the training data, to capture semantic and syntactic properties of class names. Our model SLHCat is evaluated over a benchmark dataset constructed by annotating 3000 fine-grained CaLiGraph-DBpedia mapping pairs. SLHCat is outperforming the baseline model by a large margin of 25% in accuracy, offering a practical solution for large-scale ontology mapping.
翻译:维基百科文章通过类别和列表进行层次化组织,提供了最全面通用的分类体系之一,但其开放式的创建方式导致了冗余和不一致问题。将DBpedia类分配给维基百科类别和列表可缓解这一问题,从而构建一个对通过实体链接与类型标注实现数字内容分类至关重要的大规模知识图谱。然而,现有CaLiGraph方法生成的映射存在不完整且粒度不足的问题。本文将该问题视为本体对齐任务,利用知识图谱的结构信息以及本体类名称的词汇和语义特征来发现可靠映射,进而以远程监督方式微调预训练语言模型。我们的方法SLHCat包含两个主要部分:1)利用知识图谱结构、语义相似度和命名实体类型标注自动生成训练数据;2)基于训练数据对预训练语言模型BERT进行微调和提示调优,以捕捉类名称的语义和句法特性。我们通过标注3000个细粒度CaLiGraph-DBpedia映射对构建的基准数据集对SLHCat模型进行评估。结果显示,SLHCat在准确率上以25%的显著优势超越基线模型,为大规模本体映射提供了切实可行的解决方案。