MAQA：评估大型语言模型在数据不确定性方面的不确定性量化 (MAQA: Evaluating Uncertainty Quantification in LLMs Regarding Data Uncertainty)

Despite the massive advancements in large language models (LLMs), they still suffer from producing plausible but incorrect responses. To improve the reliability of LLMs, recent research has focused on uncertainty quantification to predict whether a response is correct or not. However, most uncertainty quantification methods have been evaluated on single-labeled questions, which removes data uncertainty: the irreducible randomness often present in user queries, which can arise from factors like multiple possible answers. This limitation may cause uncertainty quantification results to be unreliable in practical settings. In this paper, we investigate previous uncertainty quantification methods under the presence of data uncertainty. Our contributions are two-fold: 1) proposing a new Multi-Answer Question Answering dataset, MAQA, consisting of world knowledge, mathematical reasoning, and commonsense reasoning tasks to evaluate uncertainty quantification regarding data uncertainty, and 2) assessing 5 uncertainty quantification methods of diverse white- and black-box LLMs. Our findings show that previous methods relatively struggle compared to single-answer settings, though this varies depending on the task. Moreover, we observe that entropy- and consistency-based methods effectively estimate model uncertainty, even in the presence of data uncertainty. We believe these observations will guide future work on uncertainty quantification in more realistic settings.

翻译：尽管大型语言模型（LLMs）取得了巨大进展，它们仍存在生成看似合理但实际错误的回答的问题。为提高LLMs的可靠性，近期研究聚焦于不确定性量化，以预测回答正确与否。然而，大多数不确定性量化方法仅在单一答案问题上进行评估，这消除了数据不确定性——用户查询中常存在的不可约随机性，可能源于多个可能答案等因素。这一局限可能导致不确定性量化结果在实际应用场景中不可靠。本文研究了在存在数据不确定性的情况下先前的不确定性量化方法。我们的贡献包括两方面：1）提出了一个新的多答案问答数据集MAQA，包含世界知识、数学推理和常识推理任务，用于评估针对数据不确定性的不确定性量化；2）评估了多种白盒与黑盒LLMs的5种不确定性量化方法。研究发现，与单一答案场景相比，先前方法的表现相对受限，尽管这种差异因任务而异。此外，我们观察到基于熵和一致性的方法能有效估计模型不确定性，即使在存在数据不确定性的情况下。我们相信这些发现将为未来在更现实场景中进行不确定性量化的研究提供指导。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日