Large language models (LLMs) have shown promise for generative and knowledge-intensive tasks including question-answering (QA) tasks. However, the practical deployment still faces challenges, notably the issue of "hallucination", where models generate plausible-sounding but unfaithful or nonsensical information. This issue becomes particularly critical in the medical domain due to the uncommon professional concepts and potential social risks involved. This paper analyses the phenomenon of hallucination in medical generative QA systems using widely adopted LLMs and datasets. Our investigation centers on the identification and comprehension of common problematic answers, with a specific emphasis on hallucination. To tackle this challenge, we present an interactive self-reflection methodology that incorporates knowledge acquisition and answer generation. Through this feedback process, our approach steadily enhances the factuality, consistency, and entailment of the generated answers. Consequently, we harness the interactivity and multitasking ability of LLMs and produce progressively more precise and accurate answers. Experimental results on both automatic and human evaluation demonstrate the superiority of our approach in hallucination reduction compared to baselines.
翻译:大语言模型(LLMs)在生成式和知识密集型任务中展现出潜力,包括问答(QA)任务。然而,实际部署仍面临挑战,尤其是“幻觉”问题,即模型生成听起来合理但不可靠或无意义的信息。由于涉及不常见的专业概念和潜在的社会风险,这一问题在医疗领域尤为关键。本文利用广泛使用的LLMs和数据集,分析了医疗生成式问答系统中幻觉现象。我们的研究聚焦于常见问题答案的识别与理解,特别强调幻觉问题。为应对这一挑战,我们提出了一种交互式自反思方法,融合了知识获取与答案生成。通过这一反馈过程,我们的方法稳步提升了生成答案的事实性、一致性和蕴含性。因此,我们充分利用了LLMs的交互性和多任务能力,逐步生成更精准、准确的答案。自动评估和人工评估的实验结果表明,与基线方法相比,我们的方法在减少幻觉方面具有显著优越性。