Benchmarks such as MMLU suggest flagship language models approach factuality saturation, with scores above 90\%. We show this picture is incomplete. \emph{LLMpedia} generates encyclopedic articles entirely from parametric memory, producing ${\sim}$1M articles across three model families without retrieval. For gpt-5-mini, the verifiable true rate on Wikipedia-covered subjects is only 74.7\% -- more than 15 percentage points below the benchmark-based picture, consistent with the availability bias of fixed-question evaluation. Beyond Wikipedia, frontier subjects verifiable only through curated web evidence fall further to 63.2\% true rate. Wikipedia covers just 61\% of surfaced subjects, and three model families overlap by only 7.3\% in subject choice. In a capture-trap benchmark inspired by prior analysis of Grokipedia, LLMpedia achieves substantially higher factuality at roughly half the textual similarity to Wikipedia. Unlike Grokipedia, every prompt, artifact, and evaluation verdict is publicly released, making LLMpedia the first fully open parametric encyclopedia -- bridging factuality evaluation and knowledge materialization. All data, code, and a browsable interface are at https://llmpedia.net.
翻译:诸如MMLU等基准测试表明,旗舰级语言模型在事实准确性方面已接近饱和,得分超过90%。但我们指出这种判断并不完整。*LLMpedia*能够完全基于参数记忆生成百科全书式文章,在三个模型家族中不依赖检索即可产出约100万篇文章。对于gpt-5-mini模型,在维基百科已覆盖的主题上,可验证的真实率仅为74.7%——比基于基准测试的评估结果低超过15个百分点,这与固定问题评估的可得性偏差一致。超出维基百科范围时,只能通过精选网络证据验证的前沿主题,其真实率进一步降至63.2%。维基百科仅覆盖了已呈现主题的61%,而三个模型家族在主题选择上的重叠率仅为7.3%。在借鉴先前Grokipedia分析设计的"捕获-陷阱"基准测试中,LLMpedia在事实准确性上显著提升,同时与维基百科的文本相似度降低约一半。与Grokipedia不同,LLMpedia公开了所有提示、工件及评估结果,成为首个完全开放的参数化百科全书——架起了事实准确性评估与知识物化之间的桥梁。所有数据、代码及可浏览界面均可在https://llmpedia.net获取。