Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks in various domains. Despite their impressive performance, they can be unreliable due to factual errors in their generations. Assessing their confidence and calibrating them across different tasks can help mitigate risks and enable LLMs to produce better generations. There has been a lot of recent research aiming to address this, but there has been no comprehensive overview to organize it and outline the main lessons learned. The present survey aims to bridge this gap. In particular, we outline the challenges and we summarize recent technical advancements for LLM confidence estimation and calibration. We further discuss their applications and suggest promising directions for future work.
翻译:大语言模型(LLMs)已在多个领域的广泛任务中展现出卓越能力。尽管性能令人印象深刻,但由于生成内容中存在事实性错误,它们可能并不可靠。评估其置信度并在不同任务中进行校准,有助于降低风险,使LLMs生成更优质的内容。近期已有大量研究致力于解决这一问题,但尚缺乏系统性的综述来整合相关成果并归纳主要经验。本综述旨在填补这一空白。具体而言,我们阐述了相关挑战,总结了LLM置信度估计与校准技术的最新进展。我们还讨论了其应用场景,并提出了未来有前景的研究方向。