LLMs have majorly advanced NLP and AI, and next to their ability to perform a wide range of procedural tasks, a major success factor is their internalized factual knowledge. Since (Petroni et al., 2019), analyzing this knowledge has gained attention. However, most approaches investigate one question at a time via modest-sized pre-defined samples, introducing an availability bias (Tversky and Kahnemann, 1973) that prevents the discovery of knowledge (or beliefs) of LLMs beyond the experimenter's predisposition. To address this challenge, we propose a novel methodology to comprehensively materializing an LLM's factual knowledge through recursive querying and result consolidation. As a prototype, we employ GPT-4o-mini to construct GPTKB, a large-scale knowledge base (KB) comprising 105 million triples for over 2.9 million entities - achieved at 1% of the cost of previous KB projects. This work marks a milestone in two areas: For LLM research, for the first time, it provides constructive insights into the scope and structure of LLMs' knowledge (or beliefs). For KB construction, it pioneers new pathways for the long-standing challenge of general-domain KB construction. GPTKB is accessible at https://gptkb.org.
翻译:大语言模型(LLMs)极大地推动了自然语言处理与人工智能的发展,除了能够执行广泛的过程性任务外,其成功的关键因素在于其内化的事实性知识。自(Petroni等人,2019)以来,分析这种知识已受到关注。然而,大多数方法通过中等规模的预定义样本一次研究一个问题,这引入了可用性偏差(Tversky和Kahnemann,1973),阻碍了发现超出实验者预设的LLM知识(或信念)。为应对这一挑战,我们提出了一种新颖的方法论,通过递归查询与结果整合来全面物化LLM的事实性知识。作为原型,我们使用GPT-4o-mini构建了GPTKB,这是一个包含超过290万个实体的1.05亿个三元组的大规模知识库(KB)——其构建成本仅为以往KB项目的1%。这项工作在两个领域标志着里程碑:对于LLM研究,它首次为LLM知识(或信念)的范围与结构提供了建设性见解;对于知识库构建,它为通用领域知识库构建这一长期挑战开辟了新途径。GPTKB可通过 https://gptkb.org 访问。