Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically, we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples, naturally bringing representations of non-hallucinative text and visual samples closer while pushing way representations of non-hallucinating and hallucinative text. We evaluate our method quantitatively and qualitatively, showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark, our method obtains a 34.66% /29.5% improvement over the baseline MiniGPT-4/LLaVA. Our code is available on https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl.
翻译:多模态大语言模型(MLLMs)已被证明能够有效整合自然语言与视觉信息以处理多模态任务。然而,MLLMs仍面临幻觉这一根本性局限,即倾向于生成错误或捏造的信息。本文从表征学习的新视角出发,研究MLLMs中的幻觉问题。我们首先分析了MLLM中文本和视觉令牌的表征分布,揭示了两项重要发现:1)文本表征与视觉表征之间存在显著差距,表明跨模态表征对齐不理想;2)包含幻觉与不含幻觉的文本表征相互纠缠,导致难以区分。这两项观察启发我们提出一种简单而有效的缓解幻觉方法。具体而言,我们将对比学习引入MLLMs,并以含幻觉文本作为难负例,自然地拉近无幻觉文本与视觉样本的表征距离,同时推远无幻觉与含幻觉文本的表征距离。我们通过定量和定性评估验证了该方法在减少幻觉发生率和提升多个基准性能方面的有效性。在MMhal-Bench基准上,我们的方法相较于基线模型MiniGPT-4/LLaVA分别获得了34.66%/29.5%的提升。我们的代码已开源在https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl。