Applications of large language models often involve the generation of free-form responses, in which case uncertainty quantification becomes challenging. This is due to the need to identify task-specific uncertainties (e.g., about the semantics) which appears difficult to define in general cases. This work addresses these challenges from a perspective of Bayesian decision theory, starting from the assumption that our utility is characterized by a similarity measure that compares a generated response with a hypothetical true response. We discuss how this assumption enables principled quantification of the model's subjective uncertainty and its calibration. We further derive a measure for epistemic uncertainty, based on a missing data perspective and its characterization as an excess risk. The proposed measures can be applied to black-box language models. We demonstrate the proposed methods on question answering and machine translation tasks, where they extract broadly meaningful uncertainty estimates from GPT and Gemini models and quantify their calibration.
翻译:大型语言模型的应用常涉及生成自由形式的响应,在此情况下不确定性量化变得极具挑战性。这源于需要识别任务特定的不确定性(例如关于语义的不确定性),而此类不确定性在一般情况下似乎难以定义。本研究从贝叶斯决策理论的视角应对这些挑战,其出发点是假设我们的效用由一种相似性度量所刻画,该度量用于比较生成响应与假设的真实响应。我们讨论了这一假设如何支持对模型主观不确定性及其校准进行原则性量化。基于缺失数据视角及其作为超额风险的表征,我们进一步推导出一种认知不确定性的度量方法。所提出的度量可应用于黑盒语言模型。我们在问答和机器翻译任务上验证了所提方法,结果表明这些方法能从GPT和Gemini模型中提取出具有广泛意义的不确定性估计,并量化其校准程度。