The recent success of large language and vision models on vision question answering (VQA), particularly their applications in medicine (Med-VQA), has shown a great potential of realizing effective visual assistants for healthcare. However, these models are not extensively tested on the hallucination phenomenon in clinical settings. Here, we created a hallucination benchmark of medical images paired with question-answer sets and conducted a comprehensive evaluation of the state-of-the-art models. The study provides an in-depth analysis of current models limitations and reveals the effectiveness of various prompting strategies.
翻译:大语言与视觉模型在视觉问答(VQA)领域的最新成功,特别是其在医学(Med-VQA)中的应用,展示了实现高效医疗视觉助手的巨大潜力。然而,这些模型在临床环境中的幻觉现象尚未得到广泛测试。为此,我们构建了一个包含医学图像及其对应问答对的幻觉基准数据集,并对当前最优模型进行了全面评估。本研究深入分析了现有模型的局限性,并揭示了多种提示策略的有效性。