Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities across a variety of domains. However, gauging the trustworthiness of responses generated by LLMs remains an open challenge, with limited research on uncertainty quantification for NLG. Furthermore, existing literature typically assumes white-box access to language models, which is becoming unrealistic either due to the closed-source nature of the latest LLMs or due to computational constraints. In this work, we investigate uncertainty quantification in NLG for $\textit{black-box}$ LLMs. We first differentiate two closely-related notions: $\textit{uncertainty}$, which depends only on the input, and $\textit{confidence}$, which additionally depends on the generated response. We then propose and compare several confidence/uncertainty metrics, applying them to $\textit{selective NLG}$, where unreliable results could either be ignored or yielded for further assessment. Our findings on several popular LLMs and datasets reveal that a simple yet effective metric for the average semantic dispersion can be a reliable predictor of the quality of LLM responses. This study can provide valuable insights for practitioners on uncertainty management when adopting LLMs. The code to replicate all our experiments is available at https://github.com/zlin7/UQ-NLG.
翻译:大型语言模型(LLMs)在自然语言生成(NLG)领域近期展现出跨领域的显著能力。然而,评估LLM生成响应的可信度仍是一项开放挑战,针对NLG不确定性量化的研究尚不充分。此外,现有文献通常假设可白盒访问语言模型,但这一假设因最新LLM的闭源特性或计算约束而日益不切实际。本研究探讨了黑箱LLM在NLG中的不确定性量化问题。我们首先区分两个紧密相关的概念:仅依赖于输入的“不确定性”,以及额外依赖于生成响应的“置信度”。随后,我们提出并比较了多种置信度/不确定性度量指标,并将其应用于“选择性NLG”场景——其中不可靠结果可被忽略或提交进一步评估。基于多个流行LLM和数据集的研究发现,一个简单而有效的平均语义离散度指标可作为LLM响应质量的可靠预测器。本研究可为实践者在采用LLM时的不确定性管理提供重要见解。复现所有实验的代码可从https://github.com/zlin7/UQ-NLG获取。