Nowadays, the research on Large Vision-Language Models (LVLMs) has been significantly promoted thanks to the success of Large Language Models (LLM). Nevertheless, these Vision-Language Models (VLMs) are suffering from the drawback of hallucination -- due to insufficient understanding of vision and language modalities, VLMs may generate incorrect perception information when doing downstream applications, for example, captioning a non-existent entity. To address the hallucination phenomenon, on the one hand, we introduce a Contrastive Instruction Evaluation Method (CIEM), which is an automatic pipeline that leverages an annotated image-text dataset coupled with an LLM to generate factual/contrastive question-answer pairs for the evaluation of the hallucination of VLMs. On the other hand, based on CIEM, we further propose a new instruction tuning method called CIT (the abbreviation of Contrastive Instruction Tuning) to alleviate the hallucination of VLMs by automatically producing high-quality factual/contrastive question-answer pairs and corresponding justifications for model tuning. Through extensive experiments on CIEM and CIT, we pinpoint the hallucination issues commonly present in existing VLMs, the disability of the current instruction-tuning dataset to handle the hallucination phenomenon and the superiority of CIT-tuned VLMs over both CIEM and public datasets.
翻译:如今,大型语言模型(LLM)的成功极大地推动了大型视觉语言模型(LVLMs)的研究进展。然而,这些视觉语言模型(VLM)仍存在幻觉缺陷——由于对视觉与语言模态的理解不足,VLM在执行下游任务(如描述不存在的实体)时可能生成错误的感知信息。为应对幻觉现象,一方面我们提出了对比式指令评估方法(CIEM),该方法是一个自动化流水线,利用带标注的图像-文本数据集结合LLM生成事实性/对比性问答对,用于评估VLM的幻觉程度。另一方面,基于CIEM,我们进一步提出了名为CIT(对比式指令微调)的新型指令微调方法,通过自动生成高质量的事实性/对比性问答对及对应理由说明来缓解VLM的幻觉问题。通过在CIEM与CIT上的大量实验,我们定位了现有多模态模型普遍存在的幻觉问题、当前指令微调数据集在处理幻觉现象上的不足,以及经CIT微调的VLM在CIEM与公开数据集上均展现出的优越性。