Although large language models (LLMs) perform impressively on many tasks, overconfidence remains a problem. We hypothesized that on multiple-choice Q&A tasks, wrong answers would be associated with smaller maximum softmax probabilities (MSPs) compared to correct answers. We comprehensively evaluate this hypothesis on ten open-source LLMs and five datasets, and find strong evidence for our hypothesis among models which perform well on the original Q&A task. For the six LLMs with the best Q&A performance, the AUROC derived from the MSP was better than random chance with p < 10^{-4} in 59/60 instances. Among those six LLMs, the average AUROC ranged from 60% to 69%. Leveraging these findings, we propose a multiple-choice Q&A task with an option to abstain and show that performance can be improved by selectively abstaining based on the MSP of the initial model response. We also run the same experiments with pre-softmax logits instead of softmax probabilities and find similar (but not identical) results.
翻译:尽管大语言模型(LLMs)在许多任务上表现出色,但过度自信仍是一个问题。我们假设在多项选择问答任务中,错误答案对应的最大softmax概率(MSP)通常低于正确答案。通过在十个开源大语言模型和五个数据集上的全面评估,我们发现该假设在原始问答任务上表现优异的模型中得到了有力支持。在六个问答性能最佳的LLM中,基于MSP的AUROC在59/60个实例中显著优于随机水平(p < 10^{-4}),平均AUROC介于60%至69%之间。基于这些发现,我们提出了一种带有放弃选项的多项选择问答任务,并证明通过根据初始模型响应的MSP选择性放弃可提升性能。此外,我们使用softmax前的logits替代softmax概率进行相同实验,发现结果相似但不完全一致。