Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities across a variety of domains. However, gauging the trustworthiness of responses generated by LLMs remains an open challenge, with limited research on uncertainty quantification (UQ) for NLG. Furthermore, existing literature typically assumes white-box access to language models, which is becoming unrealistic either due to the closed-source nature of the latest LLMs or computational constraints. In this work, we investigate UQ in NLG for *black-box* LLMs. We first differentiate *uncertainty* vs *confidence*: the former refers to the ``dispersion'' of the potential predictions for a fixed input, and the latter refers to the confidence on a particular prediction/generation. We then propose and compare several confidence/uncertainty measures, applying them to *selective NLG* where unreliable results could either be ignored or yielded for further assessment. Experiments were carried out with several popular LLMs on question-answering datasets (for evaluation purposes). Results reveal that a simple measure for the semantic dispersion can be a reliable predictor of the quality of LLM responses, providing valuable insights for practitioners on uncertainty management when adopting LLMs. The code to replicate our experiments is available at https://github.com/zlin7/UQ-NLG.
翻译:专注于自然语言生成(NLG)的大型语言模型(LLM)最近在多个领域展现出有前景的能力。然而,评估LLM生成响应的可信度仍是一个开放挑战,关于NLG中不确定性量化(UQ)的研究有限。此外,现有文献通常假设对语言模型具有白盒访问权限,这由于最新LLM的闭源性质或计算约束而变得不现实。在这项工作中,我们研究了黑盒LLM在NLG中的UQ。我们首先区分了不确定性与置信度:前者指固定输入下潜在预测的“分散程度”,后者指特定预测/生成上的置信度。随后,我们提出并比较了几种置信度/不确定性度量,并将其应用于选择性NLG,其中不可靠结果可被忽略或提交进一步评估。实验使用多个流行LLM在问答数据集(用于评估)上进行。结果表明,语义分散的简单度量能可靠预测LLM响应的质量,为从业者采用LLM时管理不确定性提供了宝贵见解。重现我们实验的代码见https://github.com/zlin7/UQ-NLG。