Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically, we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples, naturally bringing representations of non-hallucinative text and visual samples closer while pushing way representations of non-hallucinating and hallucinative text. We evaluate our method quantitatively and qualitatively, showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark, our method obtains a 34.66% /29.5% improvement over the baseline MiniGPT-4/LLaVA.
翻译:多模态大语言模型(MLLMs)已被证明能够有效融合自然语言与视觉信息以处理多模态任务。然而,MLLMs仍面临幻觉这一根本性局限,即倾向于生成错误或虚构信息。本文从表示学习的新视角探讨MLLMs中的幻觉问题。我们首先分析了MLLMs中文本与视觉标记的表示分布,揭示了两项重要发现:1)文本与视觉表示之间存在显著差距,表明跨模态表示对齐不理想;2)包含幻觉与不包含幻觉的文本表示相互纠缠,导致难以区分。这两项观察启发我们提出一种简单而有效的缓解幻觉方法。具体而言,我们将对比学习引入MLLMs,并使用含幻觉的文本作为硬负样本,自然地将无幻觉文本与视觉样本的表示拉近,同时推离无幻觉文本与幻觉文本的表示。我们通过定量和定性评估验证了该方法在减少幻觉发生、提升多项基准性能方面的有效性。在MMhal-Bench基准上,我们的方法相较基线模型MiniGPT-4/LLaVA分别获得34.66%/29.5%的提升。