Large Language Models (LLMs) have been transformative across many domains. However, hallucination, i.e., confidently outputting incorrect information, remains one of the leading challenges for LLMs. This raises the question of how to accurately assess and quantify the uncertainty of LLMs. Extensive literature on traditional models has explored Uncertainty Quantification (UQ) to measure uncertainty and employed calibration techniques to address the misalignment between uncertainty and accuracy. While some of these methods have been adapted for LLMs, the literature lacks an in-depth analysis of their effectiveness and does not offer a comprehensive benchmark to enable insightful comparison among existing solutions. In this work, we fill this gap via a systematic survey of representative prior works on UQ and calibration for LLMs and introduce a rigorous benchmark. Using two widely used reliability datasets, we empirically evaluate six related methods, which justify the significant findings of our review. Finally, we provide outlooks for key future directions and outline open challenges. To the best of our knowledge, this survey is the first dedicated study to review the calibration methods and relevant metrics for LLMs.
翻译:大型语言模型(LLMs)已在众多领域展现出变革性影响。然而,幻觉问题——即模型以高置信度输出错误信息——仍然是LLMs面临的主要挑战之一。这引发了如何准确评估和量化LLMs不确定性的关键问题。传统模型领域已有大量文献通过不确定性量化(UQ)方法度量不确定性,并采用校准技术解决不确定性与准确性之间的错配问题。尽管部分方法已被适配应用于LLMs,现有研究仍缺乏对其有效性的深入分析,且尚未建立能够实现现有解决方案间深度比较的综合基准。本研究通过系统梳理LLMs不确定性量化与校准领域的代表性成果,引入严谨的基准测试填补了这一空白。基于两个广泛使用的可靠性数据集,我们对六种相关方法进行了实证评估,其结果验证了本综述的重要发现。最后,我们展望了关键未来方向并阐明了开放挑战。据我们所知,本综述是首个专门针对LLMs校准方法及相关度量指标的系统性研究。