Large Language Models (LLMs) often lack meaningful confidence estimates for their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in open-domain question-answering tasks, despite not being explicitly trained to do so. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges as a byproduct of next-token prediction, leveraging a recent connection between calibration and local loss optimality. The theory relies on a general definition of "B-calibration," which is a notion of calibration parameterized by a choice of equivalence classes (semantic or otherwise). This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) RL instruction-tuning systematically breaks this calibration, and (3) chain-of-thought reasoning breaks calibration. To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.
翻译:大语言模型(LLMs)通常缺乏对其输出的有意义的置信度估计。虽然已知基础LLMs表现出下一词元校准特性,但它们能否在词元层面之外评估其回答实际含义的置信度仍不明确。我们发现,当采用某种基于采样的语义校准概念时,基础LLMs表现出显著的校准能力:它们能在开放域问答任务中有意义地评估置信度,尽管并未被显式训练实现此功能。我们的主要理论贡献在于建立了一种机制,解释语义校准如何作为下一词元预测的副产品而涌现,这利用了近期关于校准与局部损失最优性关联的研究。该理论依赖于“B-校准”的通用定义——这是一种由等价类(语义或其他类型)参数化的校准概念。该理论机制导出一个可检验的预测:当基础LLMs能在生成回答前轻松预测自身在语义答案类别上的分布时,它们将实现语义校准。我们阐述了该预测的三项实证推论:(1)基础LLMs在各类问答任务中均呈现语义校准特性;(2)强化学习指令微调会系统性破坏这种校准;(3)思维链推理会破坏校准。据我们所知,本研究首次从原理层面解释了LLMs中语义校准何时及为何涌现。