One important approach to improving the reliability of large language models (LLMs) is to provide accurate confidence estimations regarding the correctness of their answers. However, developing a well-calibrated confidence estimation model is challenging, as mistakes made by LLMs can be difficult to detect. We propose a novel method combining the LLM's self-consistency with labeled data and training an auxiliary model to estimate the correctness of its responses to questions. This auxiliary model predicts the correctness of responses based solely on their consistent information. To set up the learning problem, we use a weighted graph to represent the consistency among the LLM's multiple responses to a question. Correctness labels are assigned to these responses based on their similarity to the correct answer. We then train a graph neural network to estimate the probability of correct responses. Experiments demonstrate that the proposed approach substantially outperforms several of the most recent methods in confidence calibration across multiple widely adopted benchmark datasets. Furthermore, the proposed approach significantly improves the generalization capability of confidence calibration on out-of-domain (OOD) data.
翻译:提升大语言模型可靠性的一个重要途径是提供关于其答案正确性的准确置信度估计。然而,开发一个校准良好的置信度估计模型具有挑战性,因为大语言模型所犯的错误可能难以检测。我们提出了一种新颖方法,将大语言模型的自洽性与标注数据相结合,并训练一个辅助模型来估计其对问题回答的正确性。该辅助模型仅依据回答的一致性信息来预测其正确性。为构建学习问题,我们使用加权图来表示大语言模型对同一问题多个回答之间的一致性关系,并根据这些回答与正确答案的相似度为其分配正确性标签。随后,我们训练一个图神经网络来估计正确回答的概率。实验表明,在多个广泛采用的基准数据集上,所提出的方法在置信度校准方面显著优于当前最先进的几种方法。此外,该方法显著提升了置信度校准在域外数据上的泛化能力。