Large Language Models (LLMs) encode substantial factual knowledge, yet measuring and systematizing this knowledge remains challenging. Converting it into structured format, for example through recursive extraction approaches such as the GPTKB methodology (Hu et al., 2025b), is still underexplored. Key open questions include whether such extraction can terminate, whether its outputs are reproducible, and how robust they are to variations. We systematically study LLM knowledge materialization using miniGPTKBs (domain-specific, tractable subcrawls), analyzing termination, reproducibility, and robustness across three categories of metrics: yield, lexical similarity, and semantic similarity. We experiment with four variations (seed, language, randomness, model) and three illustrative domains (from history, entertainment, and finance). Our findings show (i) high termination rates, though model-dependent; (ii) mixed reproducibility; and (iii) robustness that varies by perturbation type: high for seeds and temperature, lower for languages and models. These results suggest that LLM knowledge materialization can reliably surface core knowledge, while also revealing important limitations.
翻译:大型语言模型(LLMs)编码了大量事实性知识,然而如何衡量并系统化这些知识仍具挑战性。将其转化为结构化格式(例如通过GPTKB方法等递归提取途径)的研究尚不充分。关键开放性问题包括:此类提取过程能否终止?其输出是否可复现?以及面对变量扰动时其表现有多稳健?本研究通过miniGPTKBs(领域特定、可处理的子爬取数据集)系统性地探究LLM知识具象化,从产出率、词汇相似度和语义相似度三类指标分析其终止性、可复现性与鲁棒性。我们在四种变量条件(种子、语言、随机性、模型)和三个典型领域(历史、娱乐、金融)上开展实验。结果表明:(1)终止率较高,但具有模型依赖性;(2)可复现性表现不一;(3)鲁棒性因扰动类型而异:对种子和温度扰动具有高鲁棒性,对语言和模型扰动的鲁棒性较低。这些发现表明,LLM知识具象化能可靠地提取核心知识,同时也揭示了其重要局限性。