Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts. GAVIE does not require human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate existing LMMs exhibit significant hallucinations when presented with our negative instructions, particularly Existent Object and Knowledge Manipulation instructions. Moreover, we successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model.

翻译：尽管多模态任务取得了显著进展，当前大型多模态模型（LMMs）仍倾向于对相关图像和人类指令产生不一致的幻觉描述。本文针对该问题，首次提出了大规模多样化的视觉指令微调数据集——大规模鲁棒视觉（LRV）指令数据集。该数据集包含由GPT4生成的40万条视觉指令，覆盖16种视觉与语言任务，均采用开放式指令与答案。与现有研究主要关注正例指令样本不同，我们设计LRV指令数据集包含正例与负例指令，以实现更鲁棒的视觉指令微调。负例指令在三个语义层面进行设计：（i）不存在物体操作，（ii）存在物体操作，（iii）知识操作。为高效评估LMMs生成的幻觉，我们提出GPT4辅助视觉指令评估（GAVIE）方法，该稳定方法能像人类专家一样评估视觉指令微调效果。GAVIE无需人工标注的真实答案，并可适应多样化指令格式。我们开展了全面实验探究LMMs的幻觉问题。结果表明，现有LMMs在面对负例指令（特别是存在物体操作与知识操作指令）时表现出显著幻觉。此外，通过在LRV指令数据集上微调MiniGPT4和mPLUG-Owl，我们成功缓解了幻觉问题，同时在多个公开数据集上相比现有最优方法提升了性能。我们还观察到训练数据中正负实例的均衡比例有助于构建更鲁棒的模型。