Aligning Large Multi-Modal Model with Robust Instruction Tuning

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMM) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset consists of 120k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at two semantic levels: (i) Nonexistent Element Manipulation and (ii) Existent Element Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a novel approach to evaluate visual instruction tuning without the need for human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate that existing LMMs exhibit significant hallucination when presented with our negative instructions, particularly with Existent Element Manipulation instructions. Moreover, by finetuning MiniGPT4 on LRV-Instruction, we successfully mitigate hallucination while improving performance on public datasets using less training data compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Our project link is available at https://fuxiaoliu.github.io/LRV/.

翻译：尽管在多模态任务中取得了令人瞩目的进展，当前大型多模态模型（LMM）仍容易产生与相关图像及人类指令不一致的幻觉描述。本文通过引入首个大规模且多样化的视觉指令微调数据集——大规模鲁棒视觉指令（LRV-Instruction）来解决此问题。我们的数据集包含由GPT4生成的12万条视觉指令，涵盖16个视觉与语言任务，并提供开放式指令与答案。与现有研究主要关注正向指令样本不同，我们设计的LRV-Instruction同时包含正向和负向指令，以实现更鲁棒的视觉指令微调。我们的负向指令在两个语义层面进行设计：（i）不存在元素操作和（ii）存在元素操作。为了高效衡量LMM产生的幻觉，我们提出GPT4辅助视觉指令评估（GAVIE）——一种新颖的评估视觉指令微调方法，无需人工标注的参考答案，并可适应不同指令格式。我们开展了全面实验以探究LMM的幻觉问题。结果表明，现有LMM在面对负向指令特别是存在元素操作指令时，表现出显著的幻觉现象。此外，通过在LRV-Instruction上微调MiniGPT4，我们在使用少于最先进方法的训练数据的情况下，成功缓解了幻觉问题并提升了公共数据集上的性能。我们还观察到，训练数据中正负样本的均衡比例有助于构建更鲁棒的模型。项目链接为https://fuxiaoliu.github.io/LRV/。