Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information. Deploying LLMs for medical question answering necessitates reliable uncertainty estimation (UE) methods to detect hallucinations. In this work, we benchmark popular UE methods with different model sizes on medical question-answering datasets. Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications. We also observe that larger models tend to yield better results, suggesting a correlation between model size and the reliability of UE. To address these challenges, we propose Two-phase Verification, a probability-free Uncertainty Estimation approach. First, an LLM generates a step-by-step explanation alongside its initial answer, followed by formulating verification questions to check the factual claims in the explanation. The model then answers these questions twice: first independently, and then referencing the explanation. Inconsistencies between the two sets of answers measure the uncertainty in the original response. We evaluate our approach on three biomedical question-answering datasets using Llama 2 Chat models and compare it against the benchmarked baseline methods. The results show that our Two-phase Verification method achieves the best overall accuracy and stability across various datasets and model sizes, and its performance scales as the model size increases.
翻译:大型语言模型(LLMs)在医疗健康领域的自然语言生成任务中展现出潜力,但存在产生事实性错误信息(即幻觉)的风险。将LLMs应用于医学问答需要可靠的不确定性估计(UE)方法来检测幻觉。本研究在医学问答数据集上对不同模型规模的常用UE方法进行了基准测试。结果表明,现有方法在该领域普遍表现不佳,凸显了医学应用场景中UE的挑战性。我们还观察到,更大规模的模型往往能产生更好的结果,表明模型规模与UE可靠性之间存在相关性。为应对这些挑战,我们提出了一种无需概率的不确定性估计方法——两阶段验证。首先,LLM在生成初始答案的同时,逐步生成解释说明;随后,针对解释中的事实性主张构建验证性问题。模型需两次回答这些问题:先独立作答,再参考解释作答。两组答案之间的不一致性用于衡量原始响应的不确定性。我们使用Llama 2 Chat模型在三个生物医学问答数据集上评估了该方法,并与基准基线方法进行了比较。结果显示,我们的两阶段验证方法在不同数据集和模型规模下均取得了最佳的整体准确性和稳定性,且其性能随模型规模增大而提升。