Large Language Models (LLMs) are increasingly employed in real-world applications, driving the need to evaluate the trustworthiness of their generated text. To this end, reliable uncertainty estimation is essential. Since current LLMs generate text autoregressively through a stochastic process, the same prompt can lead to varying outputs. Consequently, leading uncertainty estimation methods generate and analyze multiple output sequences to determine the LLM's uncertainty. However, generating output sequences is computationally expensive, making these methods impractical at scale. In this work, we inspect the theoretical foundations of the leading methods and explore new directions to enhance their computational efficiency. Building on the framework of proper scoring rules, we find that the negative log-likelihood of the most likely output sequence constitutes a theoretically grounded uncertainty measure. To approximate this alternative measure, we propose G-NLL, which has the advantage of being obtained using only a single output sequence generated by greedy decoding. This makes uncertainty estimation more efficient and straightforward, while preserving theoretical rigor. Empirical results demonstrate that G-NLL achieves state-of-the-art performance across various LLMs and tasks. Our work lays the foundation for efficient and reliable uncertainty estimation in natural language generation, challenging the necessity of more computationally involved methods currently leading the field.
翻译:大型语言模型(LLMs)在现实应用中的部署日益广泛,这推动了对生成文本可信度评估的需求。为此,可靠的不确定性估计至关重要。由于当前LLMs通过随机过程自回归地生成文本,同一提示可能导致不同的输出。因此,主流的不确定性估计方法通过生成和分析多个输出序列来确定LLM的不确定性。然而,生成输出序列的计算成本高昂,使得这些方法在大规模应用中不切实际。在本研究中,我们审视了主流方法的理论基础,并探索了提升其计算效率的新方向。基于恰当评分规则的框架,我们发现最可能输出序列的负对数似然构成了一种理论依据充分的不确定性度量。为了近似这一替代度量,我们提出了G-NLL方法,其优势在于仅需通过贪婪解码生成单个输出序列即可获得。这使得不确定性估计更加高效和直接,同时保持了理论严谨性。实证结果表明,G-NLL在各种LLMs和任务中均实现了最先进的性能。我们的工作为自然语言生成中高效可靠的不确定性估计奠定了基础,对当前主导领域、计算更复杂的方法的必要性提出了挑战。