Nowadays, the research on Large Vision-Language Models (LVLMs) has been significantly promoted thanks to the success of Large Language Models (LLM). Nevertheless, these Vision-Language Models (VLMs) are suffering from the drawback of hallucination -- due to insufficient understanding of vision and language modalities, VLMs may generate incorrect perception information when doing downstream applications, for example, captioning a non-existent entity. To address the hallucination phenomenon, on the one hand, we introduce a Contrastive Instruction Evaluation Method (CIEM), which is an automatic pipeline that leverages an annotated image-text dataset coupled with an LLM to generate factual/contrastive question-answer pairs for the evaluation of the hallucination of VLMs. On the other hand, based on CIEM, we further propose a new instruction tuning method called CIT (the abbreviation of Contrastive Instruction Tuning) to alleviate the hallucination of VLMs by automatically producing high-quality factual/contrastive question-answer pairs and corresponding justifications for model tuning. Through extensive experiments on CIEM and CIT, we pinpoint the hallucination issues commonly present in existing VLMs, the disability of the current instruction-tuning dataset to handle the hallucination phenomenon and the superiority of CIT-tuned VLMs over both CIEM and public datasets.
翻译:当前,得益于大语言模型(LLM)的成功,大型视觉-语言模型(LVLMs)的研究得到了显著推进。然而,这些视觉-语言模型(VLMs)正面临幻觉缺陷的困扰——由于对视觉与语言模态的理解不足,VLMs在执行下游任务(如描述不存在的实体)时可能生成错误的感知信息。为应对幻觉现象,一方面,我们提出对比指令评估方法(CIEM),该自动管线利用带标注的图像-文本数据集结合LLM,生成事实性/对比性问答对以评估VLMs的幻觉程度。另一方面,基于CIEM,我们进一步提出名为CIT(对比指令微调)的新型指令微调方法,通过自动生成高质量的事实性/对比性问答对及其对应理由用于模型调优,从而缓解VLMs的幻觉问题。通过针对CIEM与CIT的广泛实验,我们精准定位了现有VLMs普遍存在的幻觉问题、当前指令微调数据集在应对幻觉现象上的局限性,以及经CIT微调的VLMs在CIEM与公开数据集上均表现出的优越性能。