Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities across a variety of domains. However, gauging the trustworthiness of responses generated by LLMs remains an open challenge, with limited research on uncertainty quantification (UQ) for NLG. Furthermore, existing literature typically assumes white-box access to language models, which is becoming unrealistic either due to the closed-source nature of the latest LLMs or computational constraints. In this work, we investigate UQ in NLG for black-box LLMs. We first differentiate uncertainty vs confidence: the former refers to the "dispersion" of the potential predictions for a fixed input, and the latter refers to the confidence on a particular prediction/generation. We then propose and compare several confidence/uncertainty metrics, applying them to selective NLG where unreliable results could either be ignored or yielded for further assessment. Experiments were carried out with several popular LLMs on question-answering datasets (for evaluation purposes). Results reveal that a simple metric for the semantic dispersion can be a reliable predictor of the quality of LLM responses, providing valuable insights for practitioners on uncertainty management when adopting LLMs. The code to replicate our experiments is available at https://github.com/zlin7/UQ-NLG.
翻译:专注于自然语言生成(NLG)的大语言模型(LLMs)近期在多个领域展现出令人瞩目的能力。然而,评估LLMs生成回复的可信度仍是一个开放挑战,针对NLG不确定性量化(UQ)的研究十分有限。此外,现有文献通常假设对语言模型具有白盒访问权限,这一假设因最新LLMs的闭源性质或计算约束而变得不切实际。本研究针对黑盒LLMs在NLG中的UQ问题展开研究。我们首先区分不确定性与置信度:前者指固定输入下潜在预测的“离散程度”,后者则指对特定预测/生成的置信水平。随后我们提出并比较多种置信度/不确定性度量方法,并将其应用于选择性NLG场景——在该场景中不可靠结果可被忽略或提交进一步评估。我们在问答数据集(用于评估目的)上使用多个主流LLMs进行实验。结果表明,语义离散度的简单度量可作为LLM回复质量的可靠预测指标,为从业者在采用LLMs时进行不确定性管理提供了宝贵见解。可复现实验的代码见 https://github.com/zlin7/UQ-NLG。