Language Models (LMs) have shown promising performance in natural language generation. However, as LMs often generate incorrect or hallucinated responses, it is crucial to correctly quantify their uncertainty in responding to given inputs. In addition to verbalized confidence elicited via prompting, many uncertainty measures ($e.g.$, semantic entropy and affinity-graph-based measures) have been proposed. However, these measures can differ greatly, and it is unclear how to compare them, partly because they take values over different ranges ($e.g.$, $[0,\infty)$ or $[0,1]$). In this work, we address this issue by developing a novel and practical framework, termed $Rank$-$Calibration$, to assess uncertainty and confidence measures for LMs. Our key tenet is that higher uncertainty (or lower confidence) should imply lower generation quality, on average. Rank-calibration quantifies deviations from this ideal relationship in a principled manner, without requiring ad hoc binary thresholding of the correctness score ($e.g.$, ROUGE or METEOR). The broad applicability and the granular interpretability of our methods are demonstrated empirically.
翻译:语言模型(LMs)在自然语言生成中展现出令人瞩目的性能。然而,由于语言模型经常生成错误或幻觉式的回应,正确量化其对给定输入的不确定性至关重要。除了通过提示激发的言语化置信度外,许多不确定性度量(如语义熵和基于亲和图的度量)已被提出。然而,这些度量之间可能存在巨大差异,并且尚不清楚如何比较它们,部分原因是它们取值于不同范围(例如,$[0,\infty)$ 或 $[0,1]$)。在本工作中,我们通过开发一种新颖且实用的框架来解决这个问题,该框架称为“排序校准”($Rank$-$Calibration$),用于评估语言模型的不确定性和置信度度量。我们的核心原则是:更高的不确定性(或更低的置信度)应平均意味着更低的生成质量。排序校准以原则性的方式量化了与这一理想关系的偏离,无需对正确性得分(如ROUGE或METEOR)进行临时的二值阈值划分。我们方法的广泛适用性和细粒度可解释性已通过实验得到实证证明。