Large language Models (LLMs) have achieved significant breakthroughs across diverse domains; however, they can still produce unreliable or misleading outputs. For responsible LLM application, Uncertainty Quantification (UQ) techniques are used to estimate a model's uncertainty about its outputs, indicating the likelihood that those outputs may be problematic. For LLM reasoning tasks, it is essential to estimate the uncertainty not only for the final answer, but also for the intermediate steps of the reasoning, as this can enable more fine-grained and targeted interventions. In this study, we explore what UQ metrics better reflect the LLM's ``intermediate uncertainty''during reasoning. Our study reveals that an LLMs' incorrect reasoning steps tend to contain tokens which are highly sensitive to the perturbations on the preceding token embeddings. In this way, incorrect (uncertain) intermediate steps can be readily identified using this sensitivity score as guidance in practice. In our experiments, we show such perturbation-based metric achieves stronger uncertainty quantification performance compared with baseline methods such as token (generation) probability and token entropy. Besides, different from approaches that rely on multiple sampling, the perturbation-based metrics offer better simplicity and efficiency.
翻译:大语言模型(LLMs)已在多个领域取得重大突破;然而,它们仍可能产生不可靠或误导性的输出。为了负责任地应用大语言模型,不确定性量化(UQ)技术被用于估计模型对其输出的不确定性,以指示这些输出可能存在问题的可能性。对于大语言模型的推理任务,不仅需要估计最终答案的不确定性,还需要估计推理中间步骤的不确定性,因为这可以实现更细粒度和有针对性的干预。在本研究中,我们探讨了哪些不确定性量化指标能更好地反映大语言模型在推理过程中的“中间不确定性”。我们的研究表明,大语言模型错误的推理步骤往往包含对前序词元嵌入的扰动高度敏感的词汇。因此,在实践中,可以利用这种敏感性得分作为指导,轻松识别错误(不确定)的中间步骤。在我们的实验中,我们展示了这种基于扰动的指标相比基线方法(如词元(生成)概率和词元熵)实现了更强的不确定性量化性能。此外,与依赖多次采样的方法不同,基于扰动的指标提供了更好的简洁性和效率。