Confidence calibration is essential for making large language models (LLMs) reliable, yet existing training-free methods have been primarily studied under single-answer question answering. In this paper, we show that these methods break down in the presence of multiple valid answers, where disagreement among equally correct responses leads to systematic underestimation of confidence. To enable a systematic study of this phenomenon, we introduce MACE, a benchmark of 12,000 factual questions spanning six domains with varying numbers of correct answers. Experiments across 15 representative calibration methods and four LLM families (7B-72B) reveal that while accuracy increases with answer cardinality, estimated confidence consistently decreases, causing severe miscalibration for questions with mixed answer counts. To address this issue, we propose Semantic Confidence Aggregation (SCA), which aggregates confidence over multiple high-probability sampled responses. SCA achieves state-of-the-art calibration performance under mixed-answer settings while preserving strong calibration on single-answer questions.
翻译:置信度校准对于确保大语言模型(LLMs)的可靠性至关重要,然而现有的免训练方法主要是在单一答案问答场景下进行研究的。本文指出,当存在多个有效答案时,这些方法会失效,因为同等正确的回答之间的分歧会导致置信度的系统性低估。为了系统性地研究这一现象,我们引入了MACE基准测试,该基准包含涵盖六个领域、具有不同正确答案数量的12,000个事实性问题。通过对15种代表性校准方法和四个LLM系列(7B-72B)的实验发现,虽然准确率随答案基数增加而提高,但估计的置信度却持续下降,导致在答案数量混合的问题上出现严重的校准失准。为解决此问题,我们提出了语义置信度聚合(SCA)方法,该方法对多个高概率采样回答的置信度进行聚合。SCA在混合答案设置下实现了最先进的校准性能,同时在单一答案问题上保持了强大的校准能力。