The handling of probabilities in the form of uncertainty or partial information is an essential task for LLMs in many settings and applications. A common approach to evaluate an LLM's probabilistic reasoning capabilities is to assess its ability to answer questions pertaining to probability through the use of multiple-choice questions (MCQs). However, this paradigm, which we refer to as explicit probabilistic reasoning, has been shown in the literature to yield significant limitations (e.g., sensitivity to answer ordering). In this work, we introduce an alternative approach, named implicit probabilistic reasoning, which evaluates the models' ability to integrate probabilistic reasoning into their text generation process. To achieve this, we rephrase MCQs as text-completion scenarios with a determined set of outcomes and compare the model's next-token probability assignments to the true likelihood of the outcomes. In line with previous work, we find that models exhibit solid performance in their explicit probabilistic reasoning (i.e., answers to MCQs). However, during text completion (i.e., implicit probabilistic reasoning), where the same information must be taken into account to generate text, the models' predictions often significantly diverge from the known ground truth. For instance, our evaluation method reveals that implicit probabilistic reasoning is improperly influenced by many factors, such as independent prior events, partial observations about a result, or statistical background information. All of these issues can cause erroneous results to be produced in text generation, which are not detected by conventional MCQ-based evaluation.
翻译:在许多场景和应用中,以不确定性或部分信息形式处理概率是大型语言模型(LLMs)的一项基本任务。评估LLM概率推理能力的常见方法是通过多项选择题(MCQs)来评估其回答概率相关问题的能力。然而,这种我们称之为显式概率推理的范式,在文献中已被证明存在显著局限性(例如对答案顺序的敏感性)。在本研究中,我们提出了一种替代方法,称为隐式概率推理,该方法评估模型将概率推理整合到文本生成过程中的能力。为实现这一目标,我们将多项选择题重新表述为具有确定结果集的文本补全场景,并将模型的下一个词元概率分配与结果的真实可能性进行比较。与先前研究一致,我们发现模型在显式概率推理(即回答多项选择题)方面表现出稳健性能。然而,在文本补全(即隐式概率推理)过程中,模型必须考虑相同信息来生成文本,其预测结果常常与已知真实情况显著偏离。例如,我们的评估方法揭示隐式概率推理受到许多因素的不当影响,例如独立先验事件、对结果的局部观察或统计背景信息。所有这些问题都可能导致文本生成产生错误结果,而这些错误无法通过传统的基于多项选择题的评估方法检测到。