HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a fine-tuning dataset for advancing multimodal reasoning. Our experiments reveal that current models struggle with HaloQuest, with all open-source VLMs achieving below 36% accuracy. On the other hand, fine-tuning on HaloQuest significantly reduces hallucination rates while preserving performance on standard reasoning tasks. Our results discover that benchmarking with generated images is highly correlated (r=0.97) with real images. Last but not least, we propose a novel Auto-Eval mechanism that is highly correlated with human raters (r=0.99) for evaluating VLMs. In sum, this work makes concrete strides towards understanding, evaluating, and mitigating hallucination in VLMs, serving as an important step towards more reliable multimodal AI systems in the future.

翻译：幻觉一直是大型语言模型面临的主要问题，并且在涉及多模态时仍然是一个关键挑战，因为视觉语言模型（VLMs）不仅需要处理文本输入，还需要处理视觉输入。尽管VLMs取得了快速进展，但用于评估和解决多模态幻觉的资源仍然有限，且大多集中在评估方面。本研究介绍了HaloQuest，这是一个新颖的视觉问答数据集，旨在捕捉多模态幻觉的多个方面，如虚假前提、上下文不足和视觉挑战。HaloQuest的一个新颖思路是利用合成图像（除真实图像外）来实现大规模数据集构建。HaloQuest包含超过7.7K个样本，涵盖广泛的类别，旨在既作为VLMs的挑战性基准，也作为推进多模态推理的微调数据集。我们的实验表明，当前模型在HaloQuest上表现不佳，所有开源VLMs的准确率均低于36%。另一方面，在HaloQuest上进行微调可显著降低幻觉率，同时保持标准推理任务的性能。我们的结果发现，使用生成图像进行基准测试与使用真实图像具有高度相关性（r=0.97）。最后但同样重要的是，我们提出了一种新颖的Auto-Eval机制，该机制与人类评估者高度相关（r=0.99），可用于评估VLMs。总之，这项工作在理解、评估和缓解VLMs中的幻觉方面取得了具体进展，为未来构建更可靠的多模态AI系统迈出了重要一步。