Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation, establishing a foundation with potential for diverse applications.
翻译:大型语言模型(LLMs)作为通用问题求解器被广泛部署,这使得准确的置信度估计对于可靠使用至关重要。先前关于LLM校准的研究主要集中于响应层面的置信度,即评估单个生成输出的正确性。然而,这种设定与许多实际应用场景并不匹配,在这些场景中,核心问题在于模型整体上解决一个查询的可能性有多大。我们证明,这种不匹配源于现代LLM解码的随机性,在此条件下,单次响应的正确性无法反映模型的内在能力。为解决这一问题,我们引入了能力校准,其目标在于评估模型对查询的期望准确率。我们正式区分了能力校准与响应校准,并从理论和实证两方面证明了两者的差异。我们建立了一套实证评估框架,并研究了一系列置信度估计方法。结果表明,经过能力校准的置信度能够提升pass@$k$预测和推理资源分配的效能,为多样化应用奠定了具有潜力的基础。