Uncertainty Estimation of Large Language Models in Medical Question Answering

Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information. Deploying LLMs for medical question answering necessitates reliable uncertainty estimation (UE) methods to detect hallucinations. In this work, we benchmark popular UE methods with different model sizes on medical question-answering datasets. Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications. We also observe that larger models tend to yield better results, suggesting a correlation between model size and the reliability of UE. To address these challenges, we propose Two-phase Verification, a probability-free Uncertainty Estimation approach. First, an LLM generates a step-by-step explanation alongside its initial answer, followed by formulating verification questions to check the factual claims in the explanation. The model then answers these questions twice: first independently, and then referencing the explanation. Inconsistencies between the two sets of answers measure the uncertainty in the original response. We evaluate our approach on three biomedical question-answering datasets using Llama 2 Chat models and compare it against the benchmarked baseline methods. The results show that our Two-phase Verification method achieves the best overall accuracy and stability across various datasets and model sizes, and its performance scales as the model size increases.

翻译：大型语言模型（LLMs）在医疗健康领域的自然语言生成任务中展现出潜力，但存在产生事实性错误信息（即幻觉）的风险。将LLMs应用于医学问答需要可靠的不确定性估计（UE）方法来检测幻觉。本研究在医学问答数据集上对不同模型规模的常用UE方法进行了基准测试。结果表明，现有方法在该领域普遍表现不佳，凸显了医学应用场景中UE的挑战性。我们还观察到，更大规模的模型往往能产生更好的结果，表明模型规模与UE可靠性之间存在相关性。为应对这些挑战，我们提出了一种无需概率的不确定性估计方法——两阶段验证。首先，LLM在生成初始答案的同时，逐步生成解释说明；随后，针对解释中的事实性主张构建验证性问题。模型需两次回答这些问题：先独立作答，再参考解释作答。两组答案之间的不一致性用于衡量原始响应的不确定性。我们使用Llama 2 Chat模型在三个生物医学问答数据集上评估了该方法，并与基准基线方法进行了比较。结果显示，我们的两阶段验证方法在不同数据集和模型规模下均取得了最佳的整体准确性和稳定性，且其性能随模型规模增大而提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日