As the use of Large Language Models (LLMs) becomes more widespread, understanding their self-evaluation of confidence in generated responses becomes increasingly important as it is integral to the reliability of the output of these models. We introduce the concept of Confidence-Probability Alignment, that connects an LLM's internal confidence, quantified by token probabilities, to the confidence conveyed in the model's response when explicitly asked about its certainty. Using various datasets and prompting techniques that encourage model introspection, we probe the alignment between models' internal and expressed confidence. These techniques encompass using structured evaluation scales to rate confidence, including answer options when prompting, and eliciting the model's confidence level for outputs it does not recognize as its own. Notably, among the models analyzed, OpenAI's GPT-4 showed the strongest confidence-probability alignment, with an average Spearman's $\hat{\rho}$ of 0.42, across a wide range of tasks. Our work contributes to the ongoing efforts to facilitate risk assessment in the application of LLMs and to further our understanding of model trustworthiness.
翻译:随着大型语言模型(LLMs)的广泛应用,理解其对生成响应的置信度自我评估变得日益重要,因为这直接关系到模型输出的可靠性。我们提出“置信度-概率对齐”概念,将LLM通过词元概率量化的内部置信度,与模型被直接询问确定性时所表达的置信度联系起来。通过采用多种数据集及促进模型自省的提示技术,我们探究了模型内部置信度与表达置信度之间的对齐关系。这些技术包括:使用结构化评估量表进行置信度评级、在提示中包含答案选项、以及针对模型无法识别为自身生成的输出获取其置信度水平。值得注意的是,在分析的所有模型中,OpenAI的GPT-4展现出最强的置信度-概率对齐性,在广泛任务中平均斯皮尔曼$\hat{\rho}$系数达到0.42。本研究有助于持续推进LLM应用中的风险评估工作,并深化对模型可信度的理解。