Large language models (LLMs) demonstrate an impressive ability to internalize knowledge and answer natural language questions. Although previous studies validate that LLMs perform well on general knowledge while presenting poor performance on long-tail nuanced knowledge, the community is still doubtful about whether the traditional knowledge graphs should be replaced by LLMs. In this paper, we ask if the schema of knowledge graph (i.e., taxonomy) is made obsolete by LLMs. Intuitively, LLMs should perform well on common taxonomies and at taxonomy levels that are common to people. Unfortunately, there lacks a comprehensive benchmark that evaluates the LLMs over a wide range of taxonomies from common to specialized domains and at levels from root to leaf so that we can draw a confident conclusion. To narrow the research gap, we constructed a novel taxonomy hierarchical structure discovery benchmark named TaxoGlimpse to evaluate the performance of LLMs over taxonomies. TaxoGlimpse covers ten representative taxonomies from common to specialized domains with in-depth experiments of different levels of entities in this taxonomy from root to leaf. Our comprehensive experiments of eighteen state-of-the-art LLMs under three prompting settings validate that LLMs can still not well capture the knowledge of specialized taxonomies and leaf-level entities.
翻译:大型语言模型(LLMs)展现出令人印象深刻的知识内化及回答自然语言问题的能力。尽管先前研究证实LLMs在通用知识上表现良好,但在长尾细粒度知识上表现欠佳,学界仍对传统知识图谱是否应被LLMs取代存疑。本文探讨知识图谱的模式(即分类学)是否因LLMs的出现而过时。直观而言,LLMs应在常见分类学及大众熟悉的分类层级上表现优异。然而,目前缺乏一个全面评估LLMs在从通用到专业领域的广泛分类学中、从根节点到叶节点各层级性能的基准,使我们难以得出确凿结论。为填补此研究空白,我们构建了名为TaxoGlimpse的新型分类学层次结构发现基准,用于评估LLMs在分类学上的性能。TaxoGlimpse涵盖从通用到专业领域的十个代表性分类学,并对分类学中从根节点到叶节点不同层级的实体进行了深度实验。我们在三种提示设置下对十八个前沿LLMs开展的全面实验证实:LLMs仍无法充分掌握专业分类学及叶层级实体的知识。