Despite the massive advancements in large language models (LLMs), they still suffer from producing plausible but incorrect responses. To improve the reliability of LLMs, recent research has focused on uncertainty quantification to predict whether a response is correct or not. However, most uncertainty quantification methods have been evaluated on single-labeled questions, which removes data uncertainty: the irreducible randomness often present in user queries, which can arise from factors like multiple possible answers. This limitation may cause uncertainty quantification results to be unreliable in practical settings. In this paper, we investigate previous uncertainty quantification methods under the presence of data uncertainty. Our contributions are two-fold: 1) proposing a new Multi-Answer Question Answering dataset, MAQA, consisting of world knowledge, mathematical reasoning, and commonsense reasoning tasks to evaluate uncertainty quantification regarding data uncertainty, and 2) assessing 5 uncertainty quantification methods of diverse white- and black-box LLMs. Our findings show that previous methods relatively struggle compared to single-answer settings, though this varies depending on the task. Moreover, we observe that entropy- and consistency-based methods effectively estimate model uncertainty, even in the presence of data uncertainty. We believe these observations will guide future work on uncertainty quantification in more realistic settings.
翻译:尽管大型语言模型(LLMs)取得了巨大进展,它们仍存在生成看似合理但实际错误的回答的问题。为提高LLMs的可靠性,近期研究聚焦于不确定性量化,以预测回答正确与否。然而,大多数不确定性量化方法仅在单一答案问题上进行评估,这消除了数据不确定性——用户查询中常存在的不可约随机性,可能源于多个可能答案等因素。这一局限可能导致不确定性量化结果在实际应用场景中不可靠。本文研究了在存在数据不确定性的情况下先前的不确定性量化方法。我们的贡献包括两方面:1)提出了一个新的多答案问答数据集MAQA,包含世界知识、数学推理和常识推理任务,用于评估针对数据不确定性的不确定性量化;2)评估了多种白盒与黑盒LLMs的5种不确定性量化方法。研究发现,与单一答案场景相比,先前方法的表现相对受限,尽管这种差异因任务而异。此外,我们观察到基于熵和一致性的方法能有效估计模型不确定性,即使在存在数据不确定性的情况下。我们相信这些发现将为未来在更现实场景中进行不确定性量化的研究提供指导。