Mitigating hallucinations of Large Multi-modal Models(LMMs) is crucial to enhance their reliability for general-purpose assistants. This paper shows that such hallucinations of LMMs can be significantly exacerbated by preceding user-system dialogues. To precisely measure this, we first present an evaluation benchmark by extending popular multi-modal benchmark datasets with prepended hallucinatory dialogues generated by our novel Adversarial Question Generator, which can automatically generate image-related yet adversarial dialogues by adopting adversarial attacks on LMMs. On our benchmark, the zero-shot performance of state-of-the-art LMMs dropped significantly for both the VQA and Captioning tasks. Next, we further reveal this hallucination is mainly due to the prediction bias toward preceding dialogues rather than visual content. To reduce this bias, we propose Adversarial Instruction Tuning that robustly fine-tunes LMMs on augmented multi-modal instruction-following datasets with hallucinatory dialogues. Extensive experiments show that our proposed approach successfully reduces dialogue hallucination while maintaining or even improving performance.
翻译:缓解大型多模态模型中的幻觉是提升其通用助手的可靠性的关键。本文表明,大型多模态模型的幻觉可能因先前的用户-系统对话而显著加剧。为了精确衡量这一现象,我们首先构建了一个评估基准,通过在主流多模态基准数据集中嵌入由新型对抗性问题生成器生成的幻觉对话,该生成器利用对抗攻击自动生成与图像相关但具有欺骗性的对话。在该基准上,顶尖大型多模态模型在视觉问答和图像描述任务中的零样本性能均显著下降。进一步分析表明,此类幻觉主要源于模型对先前对话内容的预测偏差,而非视觉内容。为减少这一偏差,我们提出对抗性指令微调方法,通过在包含幻觉对话的增强型多模态指令遵循数据集上对大型多模态模型进行鲁棒微调。大量实验证明,所提方法在有效降低对话幻觉的同时,能保持甚至提升模型性能。