Prompting and Multiple Choices Questions (MCQ) have become the preferred approach to assess the capabilities of Large Language Models (LLMs), due to their ease of manipulation and evaluation. Such experimental appraisals have pointed toward the LLMs' apparent ability to perform causal reasoning or to grasp uncertainty. In this paper, we investigate whether these abilities are measurable outside of tailored prompting and MCQ by reformulating these issues as direct text completion - the foundation of LLMs. To achieve this goal, we define scenarios with multiple possible outcomes and we compare the prediction made by the LLM through prompting (their Stated Answer) to the probability distributions they compute over these outcomes during next token prediction (their Revealed Belief). Our findings suggest that the Revealed Belief of LLMs significantly differs from their Stated Answer and hint at multiple biases and misrepresentations that their beliefs may yield in many scenarios and outcomes. As text completion is at the core of LLMs, these results suggest that common evaluation methods may only provide a partial picture and that more research is needed to assess the extent and nature of their capabilities.
翻译:提示与多项选择题已成为评估大型语言模型能力的主流方法,因其易于操作和评估。此类实验评估表明大型语言模型似乎具备因果推理或理解不确定性的能力。本文通过将这些问题重新定义为直接文本补全任务——即大型语言模型的基础机制——来探究这些能力是否能在定制化提示和多项选择题之外被有效测量。为实现这一目标,我们构建了具有多重可能结果的场景,并比较大型语言模型通过提示生成的预测(其陈述答案)与它们在下一词元预测过程中计算得出的结果概率分布(其揭示信念)。研究结果表明,大型语言模型的揭示信念与其陈述答案存在显著差异,并暗示其信念可能在多种场景和结果中产生多重偏差与误判。鉴于文本补全是大型语言模型的核心机制,这些发现表明常规评估方法可能仅提供局部认知,需要更多研究来深入探究其能力的边界与本质。