Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.
翻译:现有的大型语言模型事实性评估将所有错误等同对待,这掩盖了失败究竟源于知识缺失(空置书架)还是源于对已编码事实的有限访问(遗失钥匙)。我们提出一个行为分析框架,在事实层面而非问题层面对事实性知识进行剖析,通过是否被编码来刻画每个事实,进而按其可访问性分类:无法回忆、可直接回忆,或仅能通过推理时计算(思考)来回忆。为支持此类剖析,我们引入了WikiProfile——一个通过基于网络搜索的提示式大型语言模型自动化流程构建的新基准。基于13个大型语言模型产生的400万个回答,我们发现前沿模型在我们的基准上知识编码已接近饱和,GPT-5和Gemini-3对事实的编码率达到95–98%。然而,回忆能力仍然是一个主要瓶颈:许多先前归因于知识缺失的错误,实则源于访问失败。这些失败具有系统性,且对长尾事实和逆向问题的影响尤为显著。最后,我们证明思考能提升回忆能力,并可挽回相当一部分失败案例,这表明未来的进展可能更少依赖于规模扩展,而更多地依赖于改进模型如何利用其已编码知识的方法。