Although Visual-Language Models (VLMs) have shown impressive capabilities in tasks like visual question answering and image captioning, they still struggle with hallucinations. Analysis of attention distribution in these models shows that VLMs tend to processing textual tokens rather than visual tokens. This imbalance of attention distribution causes VLMs to favor textual knowledge in the case of multimodal knowledge conflicts, resulting in differences from the image information. In this paper, we propose Re-Balancing Contrastive Decoding (RBD) method, which employs textual and visual branches to recalibrate attention distribution in VLMs. Specifically, the textual branch injects image noise to stimulate the model's dependency on text, thereby reducing textual bias. Concurrently, the visual branch focuses on the selection of significant tokens, refining the attention mechanism to highlight the primary subject. This dual-branch strategy enables the RBD method to diminish textual bias while enhancing visual information. Experimental results demonstrate that our method, RBD, outperforms the existing methods by the CHAIR and POPE metrics, mitigate hallucinations without reducing the model's general capabilities.
翻译:尽管视觉语言模型(VLMs)在视觉问答和图像描述等任务中展现出卓越能力,但其仍受幻觉问题困扰。对这些模型中注意力分布的分析表明,VLMs倾向于处理文本标记而非视觉标记。这种注意力分布的不平衡导致VLMs在多模态知识冲突时更偏向文本知识,从而产生与图像信息的差异。本文提出再平衡对比解码(RBD)方法,该方法通过文本分支和视觉分支重新校准VLMs中的注意力分布。具体而言,文本分支通过注入图像噪声来刺激模型对文本的依赖性,从而降低文本偏差;同时,视觉分支专注于重要标记的选择,通过优化注意力机制以突出主要主体。这种双分支策略使RBD方法能够在增强视觉信息的同时减少文本偏差。实验结果表明,我们的RBD方法在CHAIR和POPE指标上优于现有方法,能在不降低模型通用能力的前提下有效缓解幻觉问题。