Vision-language models (VLMs) increasingly combine visual and textual information to perform complex tasks. However, conflicts between their internal knowledge and external visual input can lead to hallucinations and unreliable predictions. In this work, we investigate the mechanisms that VLMs use to resolve cross-modal conflicts by introducing WHOOPS-AHA!, a dataset of multimodal counterfactual queries that deliberately contradict internal commonsense knowledge. Through logit inspection, we identify a small set of attention heads that mediate this conflict. By intervening in these heads, we can steer the model towards its internal parametric knowledge or the visual information. Our results show that attention patterns on these heads effectively locate image regions that influence visual overrides, providing a more precise attribution compared to gradient-based methods.
翻译:视觉-语言模型(VLM)日益结合视觉和文本信息以执行复杂任务。然而,其内部知识与外部视觉输入之间的冲突可能导致幻觉和不可靠的预测。本文通过引入WHOOPS-AHA!数据集——一组故意违背内部常识的多模态反事实查询,探究VLM解决跨模态冲突的机制。通过对logit的检验,我们识别出一组调解此冲突的小规模注意力头。通过干预这些注意力头,可引导模型偏向其内部参数化知识或视觉信息。结果表明,这些注意力头上的注意力模式可以有效定位影响视觉覆盖的图像区域,与基于梯度的方法相比,提供了更精确的归因。