Symbolic knowledge graphs (KGs) play a pivotal role in knowledge-centric applications such as search, question answering and recommendation. As contemporary language models (LMs) trained on extensive textual data have gained prominence, researchers have extensively explored whether the parametric knowledge within these models can match up to that present in knowledge graphs. Various methodologies have indicated that enhancing the size of the model or the volume of training data enhances its capacity to retrieve symbolic knowledge, often with minimal or no human supervision. Despite these advancements, there is a void in comprehensively evaluating whether LMs can encompass the intricate topological and semantic attributes of KGs, attributes crucial for reasoning processes. In this work, we provide an exhaustive evaluation of language models of varying sizes and capabilities. We construct nine qualitative benchmarks that encompass a spectrum of attributes including symmetry, asymmetry, hierarchy, bidirectionality, compositionality, paths, entity-centricity, bias and ambiguity. Additionally, we propose novel evaluation metrics tailored for each of these attributes. Our extensive evaluation of various LMs shows that while these models exhibit considerable potential in recalling factual information, their ability to capture intricate topological and semantic traits of KGs remains significantly constrained. We note that our proposed evaluation metrics are more reliable in evaluating these abilities than the existing metrics. Lastly, some of our benchmarks challenge the common notion that larger LMs (e.g., GPT-4) universally outshine their smaller counterparts (e.g., BERT).
翻译:符号知识图谱在搜索、问答和推荐等知识密集型应用中发挥着关键作用。随着基于大规模文本数据训练的当代语言模型日益受到关注,研究人员已广泛探索这些模型中的参数化知识是否能与知识图谱中的知识相媲美。多种方法论表明,增加模型规模或训练数据量能提升其获取符号知识的能力,且往往仅需极少或无需人工监督。尽管取得了这些进展,但在全面评估语言模型是否能涵盖知识图谱中对于推理过程至关重要的复杂拓扑与语义属性方面仍存在空白。在本研究中,我们对不同规模和能力的语言模型进行了详尽评估。我们构建了九个定性基准,涵盖对称性、非对称性、层次性、双向性、组合性、路径、实体中心性、偏差和模糊性等属性。此外,我们针对每种属性提出了新颖的评估指标。对多种语言模型的广泛评估显示,尽管这些模型在回忆事实信息方面展现出可观潜力,但其捕获知识图谱中复杂拓扑与语义特征的能力仍显著受限。我们注意到,所提出的评估指标在评估这些能力方面比现有指标更可靠。最后,我们的部分基准挑战了“更大规模的语言模型(例如GPT-4)普遍优于其较小规模对应模型(例如BERT)”的常见观点。