Large language models (LLMs) have shown remarkable performance in natural language processing (NLP) tasks. To comprehend and execute diverse human instructions over image data, instruction-tuned large vision-language models (LVLMs) have been introduced. However, LVLMs may suffer from different types of object hallucinations. Nevertheless, LVLMs are evaluated for coarse-grained object hallucinations only (i.e., generated objects non-existent in the input image). The fine-grained object attributes and behaviors non-existent in the image may still be generated but not measured by the current evaluation methods. In this paper, we thus focus on reducing fine-grained hallucinations of LVLMs. We propose \textit{ReCaption}, a framework that consists of two components: rewriting captions using ChatGPT and fine-tuning the instruction-tuned LVLMs on the rewritten captions. We also propose a fine-grained probing-based evaluation method named \textit{Fine-Grained Object Hallucination Evaluation} (\textit{FGHE}). Our experiment results demonstrate that ReCaption effectively reduces fine-grained object hallucination for different LVLM options and improves their text generation quality. The code can be found at https://github.com/Anonymousanoy/FOHE.
翻译:大型语言模型(LLM)在自然语言处理(NLP)任务中展现出卓越性能。为理解并执行基于图像数据的多样化人类指令,研究者引入了经过指令微调的大型视觉语言模型(LVLM)。然而,LVLM可能遭受不同类型的对象幻觉。现有评估仅针对粗粒度对象幻觉(即生成输入图像中不存在的对象),而图像中不存在的细粒度对象属性和行为仍可能被生成,却未被当前评估方法量化。为此,本文聚焦于减少LVLM的细粒度幻觉。我们提出ReCaption框架,该框架包含两个组件:运用ChatGPT改写图像描述,并在改写后的描述上微调指令型LVLM。同时提出基于探针的细粒度评估方法——细粒度对象幻觉评估(FGHE)。实验结果表明,ReCaption能有效减少不同LVLM选项中的细粒度对象幻觉,并提升其文本生成质量。代码见https://github.com/Anonymousanoy/FOHE。