Prompting is now a dominant method for evaluating the linguistic knowledge of large language models (LLMs). While other methods directly read out models' probability distributions over strings, prompting requires models to access this internal information by processing linguistic input, thereby implicitly testing a new type of emergent ability: metalinguistic judgment. In this study, we compare metalinguistic prompting and direct probability measurements as ways of measuring models' linguistic knowledge. Broadly, we find that LLMs' metalinguistic judgments are inferior to quantities directly derived from representations. Furthermore, consistency gets worse as the prompt query diverges from direct measurements of next-word probabilities. Our findings suggest that negative results relying on metalinguistic prompts cannot be taken as conclusive evidence that an LLM lacks a particular linguistic generalization. Our results also highlight the value that is lost with the move to closed APIs where access to probability distributions is limited.
翻译:提示(Prompting)现已成为评估大型语言模型(LLMs)语言知识的主要方法。不同于直接读取模型在字符串上的概率分布,提示要求模型通过处理语言输入来访问其内部信息,从而隐式地测试一种新型涌现能力:元语言判断。在本研究中,我们比较了元语言提示与直接概率测量作为衡量模型语言知识的方式。总体而言,我们发现LLMs的元语言判断劣于直接从表征中导出的数量指标。此外,当提示查询与下一词概率的直接测量结果偏离时,一致性会进一步恶化。我们的结果表明,依赖元语言提示得到的负面结果不能作为LLM缺乏特定语言泛化能力的决定性证据。同时,我们的研究也凸显了在转向封闭API(即限制对概率分布的访问)过程中所丧失的价值。