Instruction-tuned LLMs can respond to explicit queries formulated as prompts, which greatly facilitates interaction with human users. However, prompt-based approaches might not always be able to tap into the wealth of implicit knowledge acquired by LLMs during pre-training. This paper presents a comprehensive study of ways to evaluate semantic plausibility in LLMs. We compare base and instruction-tuned LLM performance on an English sentence plausibility task via (a) explicit prompting and (b) implicit estimation via direct readout of the probabilities models assign to strings. Experiment 1 shows that, across model architectures and plausibility datasets, (i) log likelihood ($\textit{LL}$) scores are the most reliable indicator of sentence plausibility, with zero-shot prompting yielding inconsistent and typically poor results; (ii) $\textit{LL}$-based performance is still inferior to human performance; (iii) instruction-tuned models have worse $\textit{LL}$-based performance than base models. In Experiment 2, we show that $\textit{LL}$ scores across models are modulated by context in the expected way, showing high performance on three metrics of context-sensitive plausibility and providing a direct match to explicit human plausibility judgments. Overall, $\textit{LL}$ estimates remain a more reliable measure of plausibility in LLMs than direct prompting.
翻译:指令微调的大语言模型能够对以提示形式提出的显式查询做出响应,这极大地促进了与人类用户的交互。然而,基于提示的方法可能并非总能充分利用大语言模型在预训练过程中获得的大量隐含知识。本文全面研究了评估大语言模型语义合理性的方法。我们通过(a)显式提示和(b)通过直接读取模型分配给字符串的概率进行隐式估计,比较了基础模型与指令微调模型在英语句子合理性任务上的表现。实验1表明,跨模型架构和合理性数据集:(i)对数似然($\textit{LL}$)分数是句子合理性最可靠的指标,而零样本提示生成的结果不一致且通常较差;(ii)基于$\textit{LL}$的性能仍低于人类表现;(iii)指令微调模型的$\textit{LL}$基性能劣于基础模型。在实验2中,我们显示跨模型的$\textit{LL}$分数以预期方式受到上下文调节,在三个上下文敏感性合理性指标上表现优异,并直接匹配人类明确的合理性判断。总体而言,$\textit{LL}$估计仍比直接提示更能作为大语言模型合理性的可靠度量。