In this paper, we focus on the challenging task of reliably estimating factual knowledge that is embedded inside large language models (LLMs). To avoid reliability concerns with prior approaches, we propose to eliminate prompt engineering when probing LLMs for factual knowledge. Our approach, called Zero-Prompt Latent Knowledge Estimator (ZP-LKE), leverages the in-context learning ability of LLMs to communicate both the factual knowledge question as well as the expected answer format. Our knowledge estimator is both conceptually simpler (i.e., doesn't depend on meta-linguistic judgments of LLMs) and easier to apply (i.e., is not LLM-specific), and we demonstrate that it can surface more of the latent knowledge embedded in LLMs. We also investigate how different design choices affect the performance of ZP-LKE. Using the proposed estimator, we perform a large-scale evaluation of the factual knowledge of a variety of open-source LLMs, like OPT, Pythia, Llama(2), Mistral, Gemma, etc. over a large set of relations and facts from the Wikidata knowledge base. We observe differences in the factual knowledge between different model families and models of different sizes, that some relations are consistently better known than others but that models differ in the precise facts they know, and differences in the knowledge of base models and their finetuned counterparts. Code available at: https://github.com/QinyuanWu0710/ZeroPrompt_LKE
翻译:本文聚焦于可靠评估嵌入大型语言模型(LLMs)内部事实知识这一挑战性任务。为避免先前方法存在的可靠性问题,我们提出在探测LLMs的事实知识时消除提示工程。我们的方法称为零提示潜在知识评估器(ZP-LKE),它利用LLMs的上下文学习能力来同时传递事实知识问题及预期答案格式。该知识评估器在概念上更简洁(即不依赖于LLMs的元语言判断)且更易于应用(即不特定于某个LLM),我们证明其能够揭示更多嵌入LLMs的潜在知识。我们还研究了不同设计选择如何影响ZP-LKE的性能。利用所提出的评估器,我们对多种开源LLMs(如OPT、Pythia、Llama(2)、Mistral、Gemma等)的事实知识进行了大规模评估,覆盖来自Wikidata知识库的大量关系和事实。我们观察到不同模型系列及不同规模模型之间在事实知识上的差异:某些关系始终比其他关系更易被掌握,但模型在具体知晓的事实上存在区别;同时发现基础模型与其微调版本在知识层面存在差异。代码发布于:https://github.com/QinyuanWu0710/ZeroPrompt_LKE