Large-scale vision-language pre-trained (VLP) models are prone to hallucinate non-existent visual objects when generating text based on visual information. In this paper, we systematically study the object hallucination problem from three aspects. First, we examine recent state-of-the-art VLP models, showing that they still hallucinate frequently, and models achieving better scores on standard metrics (e.g., CIDEr) could be more unfaithful. Second, we investigate how different types of image encoding in VLP influence hallucination, including region-based, grid-based, and patch-based. Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination. Third, we decouple various VLP objectives and demonstrate that token-level image-text alignment and controlled generation are crucial to reducing hallucination. Based on that, we propose a simple yet effective VLP loss named ObjMLM to further mitigate object hallucination. Results show that it reduces object hallucination by up to 17.4% when tested on two benchmarks (COCO Caption for in-domain and NoCaps for out-of-domain evaluation).
翻译:大规模视觉-语言预训练(VLP)模型在基于视觉信息生成文本时,容易产生不存在的视觉对象的幻觉。本文从三个方面系统研究了对象幻觉问题。首先,我们审视了当前最先进的VLP模型,发现它们仍频繁产生幻觉,并且在标准指标(如CIDEr)上得分更高的模型可能更不可信。其次,我们探究了VLP中不同图像编码方式(包括区域级、网格级和补丁级)对幻觉的影响。令人惊讶的是,补丁级特征表现最佳,且较小的补丁分辨率能显著降低对象幻觉。第三,我们解耦了多种VLP训练目标,并证明词级图像-文本对齐和受控生成对减少幻觉至关重要。基于此,我们提出了一种简单而有效的VLP损失函数——ObjMLM,以进一步缓解对象幻觉。在两个基准测试(域内评估使用COCO Caption,域外评估使用NoCaps)上,该方法将对象幻觉降低了高达17.4%。