Visual hallucination (VH) means that a multi-modal LLM (MLLM) imagines incorrect details about an image in visual question answering. Existing studies find VH instances only in existing image datasets, which results in biased understanding of MLLMs' performance under VH due to limited diversity of such VH instances. In this work, we propose a tool called VHTest to generate a diverse set of VH instances. Specifically, VHTest finds some initial VH instances in existing image datasets (e.g., COCO), generates a text description for each VH mode, and uses a text-to-image generative model (e.g., DALL-E-3) to generate VH images based on the text descriptions. We collect a benchmark dataset with 1,200 VH instances in 8 VH modes using VHTest. We find that existing MLLMs such as GPT-4V, LLaVA-1.5, and MiniGPT-v2 hallucinate for a large fraction of the instances in our benchmark. Moreover, we find that fine-tuning an MLLM using our benchmark dataset reduces its likelihood to hallucinate without sacrificing its performance on other benchmarks. Our benchmarks are publicly available: https://github.com/wenhuang2000/VHTest.
翻译:视觉幻觉(Visual Hallucination, VH)是指多模态大语言模型(MLLM)在视觉问答中错误地想象出图像中不存在的细节。现有研究仅在现有图像数据集中发现VH实例,由于此类实例多样性有限,导致对MLLM在VH下的性能理解存在偏差。本文提出名为VHTest的工具,用于生成多样化的VH实例。具体而言,VHTest从现有图像数据集(如COCO)中获取初始VH实例,为每种VH模式生成文本描述,并利用文本到图像生成模型(如DALL-E-3)基于这些描述生成VH图像。我们使用VHTest收集了一个包含8种VH模式、共1,200个VH实例的基准数据集。研究发现,现有的MLLM(如GPT-4V、LLaVA-1.5和MiniGPT-v2)在我们的基准测试中出现大量幻觉实例。此外,我们发现在基准数据集上微调MLLM可降低其产生幻觉的可能性,同时不影响其在其他基准测试上的性能。我们的基准数据集已公开:https://github.com/wenhuang2000/VHTest。