Error prediction in large language models often relies on domain-specific information. In this paper, we present measures for quantification of error in the response of a large language model based on the diversity of responses to a given prompt - hence independent of the underlying application. We describe how three such measures - based on entropy, Gini impurity, and centroid distance - can be employed. We perform a suite of experiments on multiple datasets and temperature settings to demonstrate that these measures strongly correlate with the probability of failure. Additionally, we present empirical results demonstrating how these measures can be applied to few-shot prompting, chain-of-thought reasoning, and error detection.
翻译:大型语言模型的错误预测通常依赖于领域特定信息。本文提出基于给定提示词响应多样性来衡量大型语言模型响应错误的度量方法——这些方法独立于底层应用。我们描述了如何运用基于熵、基尼不纯度与质心距离的三种度量方法。通过在多个数据集和温度设定下开展系列实验,我们证明这些度量与失败概率具有强相关性。此外,我们将展示这些度量如何应用于少样本提示、思维链推理及错误检测的实证结果。