Multi-modal Large Language Models (MLLMs) tuned on machine-generated instruction-following data have demonstrated remarkable performance in various multi-modal understanding and generation tasks. However, the hallucinations inherent in machine-generated data, which could lead to hallucinatory outputs in MLLMs, remain under-explored. This work aims to investigate various hallucinations (i.e., object, relation, attribute hallucinations) and mitigate those hallucinatory toxicities in large-scale machine-generated visual instruction datasets. Drawing on the human ability to identify factual errors, we present a novel hallucination detection and elimination framework, HalluciDoctor, based on the cross-checking paradigm. We use our framework to identify and eliminate hallucinations in the training data automatically. Interestingly, HalluciDoctor also indicates that spurious correlations arising from long-tail object co-occurrences contribute to hallucinations. Based on that, we execute counterfactual visual instruction expansion to balance data distribution, thereby enhancing MLLMs' resistance to hallucinations. Comprehensive experiments on hallucination evaluation benchmarks show that our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to LLaVA. The data and code for this paper are publicly available. \url{https://github.com/Yuqifan1117/HalluciDoctor}.
翻译:多模态大语言模型(MLLMs)在基于机器生成的指令遵循数据上微调后,已在多种多模态理解与生成任务中展现出卓越性能。然而,机器生成数据中固有的幻觉(可能导致MLLMs产生幻觉输出)仍未被充分探究。本文旨在系统研究各类幻觉(即对象幻觉、关系幻觉、属性幻觉),并缓解大规模机器生成视觉指令数据集中的这些幻觉毒性。借鉴人类识别事实错误的能力,我们提出了一种基于交叉验证范式的新型幻觉检测与消除框架——HalluciDoctor。我们利用该框架自动识别并消除训练数据中的幻觉。有趣的是,HalluciDoctor还表明长尾对象共现产生的虚假关联会诱发幻觉。基于此,我们通过反事实视觉指令扩展来平衡数据分布,从而增强MLLMs对幻觉的抵抗能力。在幻觉评估基准上的全面实验证明,我们的方法相对缓解了44.6%的幻觉,并保持了与LLaVA相当的竞争性能。本文数据和代码已公开。\url{https://github.com/Yuqifan1117/HalluciDoctor}