Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMM) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset consists of 120k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at two semantic levels: (i) Nonexistent Element Manipulation and (ii) Existent Element Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a novel approach to evaluate visual instruction tuning without the need for human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate that existing LMMs exhibit significant hallucination when presented with our negative instructions, particularly with Existent Element Manipulation instructions. Moreover, by finetuning MiniGPT4 on LRV-Instruction, we successfully mitigate hallucination while improving performance on public datasets using less training data compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Updates of our project are available at https://fuxiaoliu.github.io/LRV/.

翻译：尽管在多模态任务中取得了令人鼓舞的进展，当前的大型多模态模型（LMM）仍容易生成与相关图像及人类指令不一致的幻觉描述。本文通过引入首个大规模、多样化的视觉指令微调数据集——大规模鲁棒视觉（LRV）-指令数据集来解决这一问题。该数据集包含由GPT4生成的12万条视觉指令，覆盖16个视觉-语言任务，采用开放式指令和答案。与现有研究主要关注正向指令样本不同，我们设计的LRV-指令同时包含正向和负向指令，以实现更鲁棒的视觉指令微调。负向指令在两个语义层面设计：（i）无素操作和（ii）有素操作。为有效衡量LMM产生的幻觉，我们提出GPT4辅助视觉指令评估（GAVIE）方法，该方法无需人工标注的参考答案即可评估视觉指令微调，并能适应多种指令格式。我们通过全面实验探究了LMM的幻觉现象。结果表明，现有LMM在面临负向指令时，尤其是有素操作指令时，表现出显著的幻觉。此外，通过在LRV-指令上微调MiniGPT4，我们成功缓解了幻觉，同时在使用比最先进方法更少训练数据的情况下提升了公开数据集上的性能。我们还观察到训练数据中正向和负向实例的比例均衡可使模型更鲁棒。项目更新详见https://fuxiaoliu.github.io/LRV/。