Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of Flan-ul2, Llama-13b and Mistral-7b on four benchmark Q\&A tasks as well as of Pegasus-large and BART-large on two benchmark summarization tasks with it surpassing baselines by even over $10\%$ (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.
翻译:评估模型响应的不确定性或置信度对于衡量对模型响应乃至模型整体的可信度具有重要意义。本文探讨了在仅具备黑盒或查询访问权限的情况下,为大语言模型(LLMs)的响应估计置信度的问题。我们提出了一种简单且可扩展的框架:通过设计新颖的特征,并基于这些特征训练一个(可解释的)模型(即逻辑回归)来估计置信度。我们通过实证表明,该简单框架能有效估计 Flan-ul2、Llama-13b 和 Mistral-7b 在四个基准问答任务上的置信度,以及 Pegasus-large 和 BART-large 在两个基准摘要任务上的置信度,在某些情况下其 AUROC 指标甚至超过基线方法超过 $10\%$。此外,我们的可解释方法揭示了哪些特征对置信度具有预测性,并由此得出了一个有趣且实用的发现:针对某个 LLM 构建的置信度模型,在给定数据集上能够零样本泛化到其他 LLM。