Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities for capturing and reasoning over multimodal inputs. However, these models are prone to parametric knowledge conflicts, which arise from inconsistencies of represented knowledge between their vision and language components. In this paper, we formally define the problem of $\textbf{cross-modality parametric knowledge conflict}$ and present a systematic approach to detect, interpret, and mitigate them. We introduce a pipeline that identifies conflicts between visual and textual answers, showing a persistently high conflict rate across modalities in recent LVLMs regardless of the model size. We further investigate how these conflicts interfere with the inference process and propose a contrastive metric to discern the conflicting samples from the others. Building on these insights, we develop a novel dynamic contrastive decoding method that removes undesirable logits inferred from the less confident modality components based on answer confidence. For models that do not provide logits, we also introduce two prompt-based strategies to mitigate the conflicts. Our methods achieve promising improvements in accuracy on both the ViQuAE and InfoSeek datasets. Specifically, using LLaVA-34B, our proposed dynamic contrastive decoding improves an average accuracy of 2.24%.
翻译:大型视觉语言模型(LVLMs)在捕捉和推理多模态输入方面展现出令人印象深刻的能力。然而,这些模型容易产生参数化知识冲突,这种冲突源于其视觉与语言组件之间所表征知识的不一致性。本文正式定义了**跨模态参数化知识冲突**问题,并提出了一种系统性的方法来检测、解释和缓解此类冲突。我们引入了一个识别视觉与文本答案间冲突的流程,结果表明在近期的大型视觉语言模型中,无论模型规模大小,跨模态冲突率始终居高不下。我们进一步研究了这些冲突如何干扰推理过程,并提出了一种对比度量来区分冲突样本与非冲突样本。基于这些发现,我们开发了一种新颖的动态对比解码方法,该方法根据答案置信度,移除从置信度较低的模态组件推断出的不良对数概率。对于不提供对数概率的模型,我们还引入了两种基于提示的策略来缓解冲突。我们的方法在ViQuAE和InfoSeek数据集上的准确率均取得了显著提升。具体而言,在LLaVA-34B模型上,我们提出的动态对比解码方法将平均准确率提高了2.24%。