SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales

Large language models (LLMs) often generate inaccurate or fabricated information and generally fail to indicate their confidence, which limits their broader applications. Previous work elicits confidence from LLMs by direct or self-consistency prompting, or constructing specific datasets for supervised finetuning. The prompting-based approaches have inferior performance, and the training-based approaches are limited to binary or inaccurate group-level confidence estimates. In this work, we present the advanced SaySelf, a training framework that teaches LLMs to express more accurate fine-grained confidence estimates. In addition, beyond the confidence scores, SaySelf initiates the process of directing LLMs to produce self-reflective rationales that clearly identify gaps in their parametric knowledge and explain their uncertainty. This is achieved by using an LLM to automatically summarize the uncertainties in specific knowledge via natural language. The summarization is based on the analysis of the inconsistency in multiple sampled reasoning chains, and the resulting data is utilized for supervised fine-tuning. Moreover, we utilize reinforcement learning with a meticulously crafted reward function to calibrate the confidence estimates, motivating LLMs to deliver accurate, high-confidence predictions and to penalize overconfidence in erroneous outputs. Experimental results in both in-distribution and out-of-distribution datasets demonstrate the effectiveness of SaySelf in reducing the confidence calibration error and maintaining the task performance. We show that the generated self-reflective rationales are reasonable and can further contribute to the calibration. The code is made public at https://github.com/xu1868/SaySelf.

翻译：大语言模型（LLMs）常生成不准确或虚构信息，且通常无法表明其置信度，这限制了其更广泛的应用。先前研究通过直接提示、自洽性提示或构建特定数据集进行监督微调来从LLMs中获取置信度。基于提示的方法性能较差，而基于训练的方法仅限于二元或不准确的组级置信度估计。本研究提出先进的SaySelf训练框架，该框架教导LLMs表达更准确的细粒度置信度估计。此外，除置信度分数外，SaySelf开创性地引导LLMs生成自反思推理链，以清晰识别其参数知识中的缺陷并解释其不确定性。这是通过使用LLM以自然语言自动总结特定知识中的不确定性实现的。总结过程基于对多个采样推理链不一致性的分析，所得数据用于监督微调。进一步，我们采用强化学习配合精心设计的奖励函数来校准置信度估计，激励LLMs提供准确的高置信度预测，并对错误输出中的过度自信进行惩罚。在分布内与分布外数据集上的实验结果均证明SaySelf在降低置信度校准误差及保持任务性能方面的有效性。我们证明生成的自反思推理链具有合理性，并能进一步促进校准过程。代码已公开于https://github.com/xu1868/SaySelf。