Large vision-language models (LVLMs) have recently dramatically pushed the state of the art in image captioning and many image understanding tasks (e.g., visual question answering). LVLMs, however, often \textit{hallucinate} and produce captions that mention concepts that cannot be found in the image. These hallucinations erode the trustworthiness of LVLMs and are arguably among the main obstacles to their ubiquitous adoption. Recent work suggests that addition of grounding objectives -- those that explicitly align image regions or objects to text spans -- reduces the amount of LVLM hallucination. Although intuitive, this claim is not empirically justified as the reduction effects have been established, we argue, with flawed evaluation protocols that (i) rely on data (i.e., MSCOCO) that has been extensively used in LVLM training and (ii) measure hallucination via question answering rather than open-ended caption generation. In this work, in contrast, we offer the first systematic analysis of the effect of fine-grained object grounding on LVLM hallucination under an evaluation protocol that more realistically captures LVLM hallucination in open generation. Our extensive experiments over three backbone LLMs reveal that grounding objectives have little to no effect on object hallucination in open caption generation.
翻译:大型视觉语言模型(LVLM)近期在图像描述生成及诸多图像理解任务(如视觉问答)中显著推动了技术前沿。然而,LVLM常产生\textit{幻觉},生成包含图像中不存在概念的描述。这种幻觉损害了LVLM的可信度,并可能成为其广泛普及的主要障碍之一。近期研究表明,引入定位目标——即显式对齐图像区域或物体与文本片段——能够减少LVLM的幻觉现象。尽管这一观点直观合理,但其论断缺乏实证依据:现有研究采用的评估方案存在缺陷,具体表现为(i)依赖已被广泛用于LVLM训练的数据集(如MSCOCO),且(ii)通过问答任务而非开放式描述生成来衡量幻觉。与此相反,本研究首次在更真实反映开放式生成中LVLM幻觉的评估框架下,系统分析了细粒度物体定位对LVLM幻觉的影响。基于三种骨干大语言模型的广泛实验表明,在开放式描述生成任务中,定位目标对物体幻觉的改善作用微乎其微。