Although large language models (LLMs) are capable of performing various tasks, they still suffer from producing plausible but incorrect responses. To improve the reliability of LLMs, recent research has focused on uncertainty quantification to predict whether a response is correct or not. However, most uncertainty quantification methods have been evaluated on questions requiring a single clear answer, ignoring the existence of data uncertainty that arises from irreducible randomness. Instead, these methods only consider model uncertainty, which arises from a lack of knowledge. In this paper, we investigate previous uncertainty quantification methods under the presence of data uncertainty. Our contributions are two-fold: 1) proposing a new Multi-Answer Question Answering dataset, MAQA, consisting of world knowledge, mathematical reasoning, and commonsense reasoning tasks to evaluate uncertainty quantification regarding data uncertainty, and 2) assessing 5 uncertainty quantification methods of diverse white- and black-box LLMs. Our findings show that entropy and consistency-based methods estimate the model uncertainty well even under data uncertainty, while other methods for white- and black-box LLMs struggle depending on the tasks. Additionally, methods designed for white-box LLMs suffer from overconfidence in reasoning tasks compared to simple knowledge queries. We believe our observations will pave the way for future work on uncertainty quantification in realistic setting.
翻译:尽管大型语言模型(LLMs)能够执行多种任务,但仍存在生成看似合理但实际错误的回答的问题。为提高LLMs的可靠性,近期研究聚焦于不确定性量化,以预测回答是否正确。然而,大多数不确定性量化方法的评估均基于仅需单一明确答案的问题,忽略了由不可约随机性引起的数据不确定性的存在。这些方法仅考虑了由知识缺乏引起的模型不确定性。本文研究了在数据不确定性存在的情况下先前的不确定性量化方法。我们的贡献包括两方面:1)提出了一个新的多答案问答数据集MAQA,包含世界知识、数学推理和常识推理任务,用于评估针对数据不确定性的不确定性量化;2)评估了五种针对不同白盒与黑盒LLMs的不确定性量化方法。研究结果表明,基于熵和一致性的方法即使在数据不确定性下也能较好地估计模型不确定性,而其他针对白盒与黑盒LLMs的方法则在不同任务中表现不一。此外,为白盒LLMs设计的方法在推理任务中相较于简单的知识查询更容易出现过度自信问题。我们相信这些发现将为未来在现实场景中开展不确定性量化研究提供重要参考。