When vision contradicts text, multimodal large language models (MLLMs) consistently favor text, even when images provide clear evidence otherwise. This bias poses risks for applications requiring visual grounding, yet its cause remains unclear. In this paper, we uncover a surprising finding: models often get it right initially, forming correct vision-based predictions in their intermediate layers, before changing their minds and favoring text in the final output. We call this "late-layer textual override". The visual information is encoded, it simply does not survive to the output. More intriguingly, we find that how predictions change reveals whether they're correct: 85% of failures shift toward text, while 89% of successes shift toward vision. This directional signature enables a simple but powerful intervention: when we detect a confident visual prediction being suppressed, we restore it. We propose CALRD (Conflict-Aware Layer Reference Decoding), a training-free method that recovers overridden predictions at inference time. Experiments across five MLLMs of varying architectures demonstrate up to 9.4% absolute improvements on conflict benchmarks while largely preserving standard performance, without training or external knowledge. It recovers what the model already knew but failed to preserve.
翻译:当视觉信息与文本矛盾时,多模态大语言模型(MLLMs)会始终偏向文本,即便图像提供了明确的相反证据。这种偏差在需要视觉锚定的应用中构成风险,但其成因尚不明确。本文揭示了一个惊人发现:模型最初往往能给出正确答案,在中间层形成基于视觉的正确预测,却在最终输出时改变立场偏向文本。我们将此称为"后层文本覆盖"。视觉信息已被编码,却未能存活到输出层。更有趣的是,预测结果的变化方式揭示了其正确性:85%的错误预测转向文本,而89%的正确预测转向视觉。这一方向性特征催生了简单而有效的干预手段:当我们检测到被抑制的视觉置信预测时,便将其恢复。我们提出CALRD(冲突感知层参考解码)——一种无需训练的方法,可在推理阶段恢复被覆盖的预测。在五种不同架构的MLLMs上的实验表明,该方法在冲突基准测试上实现了高达9.4%的绝对性能提升,同时基本保持标准性能,且无需训练或外部知识。它恢复了模型已知却未能保留的信息。