Knowledge graphs (KGs) consist of links that describe relationships between entities. Due to the difficulty of manually enumerating all relationships between entities, automatically completing them is essential for KGs. Knowledge Graph Completion (KGC) is a task that infers unseen relationships between entities in a KG. Traditional embedding-based KGC methods, such as RESCAL, TransE, DistMult, ComplEx, RotatE, HAKE, HousE, etc., infer missing links using only the knowledge from training data. In contrast, the recent Pre-trained Language Model (PLM)-based KGC utilizes knowledge obtained during pre-training. Therefore, PLM-based KGC can estimate missing links between entities by reusing memorized knowledge from pre-training without inference. This approach is problematic because building KGC models aims to infer unseen links between entities. However, conventional evaluations in KGC do not consider inference and memorization abilities separately. Thus, a PLM-based KGC method, which achieves high performance in current KGC evaluations, may be ineffective in practical applications. To address this issue, we analyze whether PLM-based KGC methods make inferences or merely access memorized knowledge. For this purpose, we propose a method for constructing synthetic datasets specified in this analysis and conclude that PLMs acquire the inference abilities required for KGC through pre-training, even though the performance improvements mostly come from textual information of entities and relations.
翻译:知识图谱由描述实体间关系的链接组成。由于手动列举所有实体间关系存在困难,自动补全这些关系对知识图谱至关重要。知识图谱补全是一项推断知识图谱中实体间未见关系的任务。传统的基于嵌入的KGC方法,如RESCAL、TransE、DistMult、ComplEx、RotatE、HAKE、HousE等,仅利用训练数据中的知识推断缺失链接。相比之下,最近的基于预训练语言模型的KGC方法利用了预训练过程中获取的知识。因此,基于PLM的KGC可能仅通过复用预训练中记忆的知识来估计实体间的缺失链接,而无需进行推理。这种方法存在问题,因为构建KGC模型的目标是推断实体间的未见链接。然而,KGC中的传统评估并未区分推理能力和记忆能力。因此,在当前KGC评估中表现优异的基于PLM的KGC方法可能在实际应用中效果不佳。为解决这一问题,我们分析了基于PLM的KGC方法是否真正进行了推理,抑或仅仅访问了记忆的知识。为此,我们提出了一种构建该分析所需的合成数据集的方法,并得出结论:尽管性能提升主要来源于实体和关系的文本信息,但PLM通过预训练获得了KGC所需的推理能力。