The tendency of Large Language Models to generate hallucinations and exhibit overconfidence in predictions raises concerns regarding their reliability. Confidence or uncertainty estimations indicating the extent of trustworthiness of a model's response are essential to developing reliable AI systems. Current research primarily focuses on LLM confidence estimations in English, remaining a void for other widely used languages and impeding the global development of reliable AI applications. This paper introduces a comprehensive investigation of Multi-lingual confidence estimation (MlingConf) on LLMs. First, we introduce an elaborated and expert-checked multilingual QA dataset. Second, we delve into the performance of confidence estimations and examine how these confidence scores can enhance LLM performance through self-refinement across diverse languages. Finally, we propose a cross-lingual confidence estimation method to achieve more precise confidence scores. The experimental results showcase the performance of various confidence estimation methods across different languages as well as present that our proposed cross-lingual confidence estimation technique significantly enhances confidence estimation and outperforms several baseline methods.
翻译:大型语言模型生成幻觉且对预测表现出过度自信的倾向引发对其可靠性的担忧。指示模型响应可信度的置信度或不确定性估计对于开发可靠的人工智能系统至关重要。当前研究主要关注英语中的大语言模型置信度估计,在其他广泛使用的语言方面仍存空白,这阻碍了可靠人工智能应用的全球发展。本文对大型语言模型的多语言置信度估计(MlingConf)进行了全面研究。首先,我们引入了一个经过精心设计且由专家校验的多语言问答数据集。其次,我们深入探究了置信度估计的性能,并检验这些置信度分数如何通过跨语言的自我优化来提升大语言模型性能。最后,我们提出一种跨语言置信度估计方法以实现更精准的置信度分数。实验结果表明了不同语言上各类置信度估计方法的性能,同时展示了我们提出的跨语言置信度估计技术显著提升了置信度估计效果,并优于若干基准方法。