General-domain knowledge bases (KB), in particular the "big three" -- Wikidata, Yago and DBpedia -- are the backbone of many intelligent applications. While these three have seen steady development, comprehensive KB construction at large has seen few fresh attempts. In this work, we propose to build a large general-domain KB entirely from a large language model (LLM). We demonstrate the feasibility of large-scale KB construction from LLMs, while highlighting specific challenges arising around entity recognition, entity and property canonicalization, and taxonomy construction. As a prototype, we use GPT-4o-mini to construct GPTKB, which contains 105 million triples for more than 2.9 million entities, at a cost 100x less than previous KBC projects. Our work is a landmark for two fields: For NLP, for the first time, it provides \textit{constructive} insights into the knowledge (or beliefs) of LLMs. For the Semantic Web, it shows novel ways forward for the long-standing challenge of general-domain KB construction. GPTKB is accessible at http://gptkb.org.
翻译:通用领域知识库(KB),尤其是“三大”知识库——Wikidata、Yago和DBpedia——是众多智能应用的支柱。尽管这三个知识库在稳步发展,但大规模综合性知识库的构建总体上鲜有新的尝试。在本工作中,我们提出完全基于大型语言模型(LLM)构建一个大型通用领域知识库。我们论证了基于LLM进行大规模知识库构建的可行性,同时指出了在实体识别、实体与属性规范化以及分类体系构建方面出现的具体挑战。作为原型,我们使用GPT-4o-mini构建了GPTKB,该知识库包含超过290万个实体的1.05亿个三元组,其构建成本比以往的知识库构建项目低100倍。我们的工作是两个领域的里程碑:对于自然语言处理领域,它首次为理解LLM所蕴含的知识(或信念)提供了建设性的见解。对于语义网领域,它为通用领域知识库构建这一长期挑战指明了新的前进方向。GPTKB可通过http://gptkb.org访问。