The advancement of Large Vision-Language Models (LVLMs) has increasingly highlighted the critical issue of their tendency to hallucinate non-existing objects in the images. To address this issue, previous works focused on using specially curated datasets or powerful LLMs (e.g., GPT-3.5) to rectify the outputs of LVLMs. However, these approaches require either expensive training/fine-tuning or API access to advanced LLMs to correct the model's output post-generation. In this paper, we tackle this challenge by introducing a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE), which is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process. Specifically, MARINE enriches the visual context of LVLMs by integrating existing open-source vision models, and employs classifier-free guidance to incorporate the additional object grounding features to improve the precision of LVLMs' generations. Through comprehensive evaluations across $6$ popular LVLMs with diverse evaluation metrics, we demonstrate the effectiveness of MARINE, which even outperforms existing fine-tuning-based methods. Remarkably, it not only reduces hallucinations but also improves the detailedness of LVLMs' generations, as assessed by GPT-4V.
翻译:大规模视觉语言模型(LVLMs)的快速发展日益凸显出其倾向于对图像中不存在的物体产生幻觉这一关键问题。为解决此问题,先前研究主要采用专门构建的数据集或利用强大的大语言模型(如GPT-3.5)来纠正LVLMs的输出。然而,这些方法要么需要昂贵的训练/微调过程,要么需要借助高级LLM的API访问权限才能在生成后修正模型输出。本文通过引入名为MARINE(基于无分类器引导的幻觉缓解框架)的解决方案来应对这一挑战,该框架无需训练且无需API调用,能够在生成过程中有效且高效地减少物体幻觉。具体而言,MARINE通过集成现有开源视觉模型来丰富LVLMs的视觉上下文,并采用无分类器引导机制融合额外的物体定位特征,以提升LVLMs生成结果的精确性。通过在6种主流LVLMs上使用多样化评估指标的综合评测,我们验证了MARINE的有效性,其性能甚至超越了现有基于微调的方法。值得注意的是,根据GPT-4V的评估,该方法不仅减少了幻觉现象,还增强了LVLMs生成内容的详细程度。