Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically, we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples, naturally bringing representations of non-hallucinative text and visual samples closer while pushing way representations of non-hallucinating and hallucinative text. We evaluate our method quantitatively and qualitatively, showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark, our method obtains a 34.66% /29.5% improvement over the baseline MiniGPT-4/LLaVA.

翻译：多模态大语言模型（MLLMs）已被证明能够有效融合自然语言与视觉信息以处理多模态任务。然而，MLLMs仍面临幻觉这一根本性局限，即倾向于生成错误或虚构信息。本文从表示学习的新视角探讨MLLMs中的幻觉问题。我们首先分析了MLLMs中文本与视觉标记的表示分布，揭示了两项重要发现：1）文本与视觉表示之间存在显著差距，表明跨模态表示对齐不理想；2）包含幻觉与不包含幻觉的文本表示相互纠缠，导致难以区分。这两项观察启发我们提出一种简单而有效的缓解幻觉方法。具体而言，我们将对比学习引入MLLMs，并使用含幻觉的文本作为硬负样本，自然地将无幻觉文本与视觉样本的表示拉近，同时推离无幻觉文本与幻觉文本的表示。我们通过定量和定性评估验证了该方法在减少幻觉发生、提升多项基准性能方面的有效性。在MMhal-Bench基准上，我们的方法相较基线模型MiniGPT-4/LLaVA分别获得34.66%/29.5%的提升。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日