缓解视觉语言模型中的幻觉问题：基于再平衡对比解码的方法 (Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding)

Although Visual-Language Models (VLMs) have shown impressive capabilities in tasks like visual question answering and image captioning, they still struggle with hallucinations. Analysis of attention distribution in these models shows that VLMs tend to processing textual tokens rather than visual tokens. This imbalance of attention distribution causes VLMs to favor textual knowledge in the case of multimodal knowledge conflicts, resulting in differences from the image information. In this paper, we propose Re-Balancing Contrastive Decoding (RBD) method, which employs textual and visual branches to recalibrate attention distribution in VLMs. Specifically, the textual branch injects image noise to stimulate the model's dependency on text, thereby reducing textual bias. Concurrently, the visual branch focuses on the selection of significant tokens, refining the attention mechanism to highlight the primary subject. This dual-branch strategy enables the RBD method to diminish textual bias while enhancing visual information. Experimental results demonstrate that our method, RBD, outperforms the existing methods by the CHAIR and POPE metrics, mitigate hallucinations without reducing the model's general capabilities.

翻译：尽管视觉语言模型（VLMs）在视觉问答和图像描述等任务中展现出卓越能力，但其仍受幻觉问题困扰。对这些模型中注意力分布的分析表明，VLMs倾向于处理文本标记而非视觉标记。这种注意力分布的不平衡导致VLMs在多模态知识冲突时更偏向文本知识，从而产生与图像信息的差异。本文提出再平衡对比解码（RBD）方法，该方法通过文本分支和视觉分支重新校准VLMs中的注意力分布。具体而言，文本分支通过注入图像噪声来刺激模型对文本的依赖性，从而降低文本偏差；同时，视觉分支专注于重要标记的选择，通过优化注意力机制以突出主要主体。这种双分支策略使RBD方法能够在增强视觉信息的同时减少文本偏差。实验结果表明，我们的RBD方法在CHAIR和POPE指标上优于现有方法，能在不降低模型通用能力的前提下有效缓解幻觉问题。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日