Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Empowering large language models to accurately express confidence in their answers is essential for trustworthy decision-making. Previous confidence elicitation methods, which primarily rely on white-box access to internal model information or model fine-tuning, have become less suitable for LLMs, especially closed-source commercial APIs. This leads to a growing need to explore the untapped area of black-box approaches for LLM uncertainty estimation. To better break down the problem, we define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency. We then benchmark these methods on two key tasks-confidence calibration and failure prediction-across five types of datasets (e.g., commonsense and arithmetic reasoning) and five widely-used LLMs including GPT-4 and LLaMA 2 Chat. Our analysis uncovers several key insights: 1) LLMs, when verbalizing their confidence, tend to be overconfident, potentially imitating human patterns of expressing confidence. 2) As model capability scales up, both calibration and failure prediction performance improve. 3) Employing our proposed strategies, such as human-inspired prompts, consistency among multiple responses, and better aggregation strategies can help mitigate this overconfidence from various perspectives. 4) Comparisons with white-box methods indicate that while white-box methods perform better, the gap is narrow, e.g., 0.522 to 0.605 in AUROC. Despite these advancements, none of these techniques consistently outperform others, and all investigated methods struggle in challenging tasks, such as those requiring professional knowledge, indicating significant scope for improvement. We believe this study can serve as a strong baseline and provide insights for eliciting confidence in black-box LLMs.

翻译：赋予大语言模型对其答案准确表达置信度的能力，对于可信赖的决策至关重要。以往的置信度激发方法主要依赖于对模型内部信息的白盒访问或模型微调，这些方法已逐渐不适用于大语言模型，尤其是闭源商业应用程序接口。这促使我们亟需探索黑盒方法这一未充分利用的领域，用于大语言模型的不确定性估计。为了更好地分解问题，我们构建了一个包含三个组成部分的系统性框架：用于激发语言化置信度的提示策略、用于生成多个响应的采样方法，以及用于计算一致性的聚合技术。随后，我们在两个关键任务——置信度校准与失败预测——上对这些方法进行了基准测试，涵盖五类数据集（例如常识推理和算术推理）以及包括GPT-4和LLaMA 2 Chat在内的五种广泛使用的大语言模型。我们的分析揭示了若干关键洞察：1）大语言模型在语言化其置信度时，倾向于过度自信，可能模仿了人类表达置信度的模式。2）随着模型能力的提升，校准性能与失败预测性能均有所改善。3）采用我们提出的策略，例如类人提示、多响应间的一致性以及更好的聚合策略，可以从不同角度缓解这种过度自信。4）与白盒方法的比较显示，尽管白盒方法表现更优，但差距不大（例如AUROC从0.522到0.605）。尽管取得了这些进展，但没有任何一种技术能持续优于其他方法，且所有被研究的方法在困难任务（例如需要专业知识的任务）中均表现挣扎，这表明存在显著的改进空间。我们相信，本研究可作为强有力的基准，并为黑盒大语言模型的置信度激发提供洞见。