Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a variety of vision-language tasks. However, their internal reasoning often exhibits a critical inconsistency: although deeper layers may attend to the correct visual regions, final predictions are frequently misled by noisy attention from earlier layers. This results in a disconnect between what the model internally understands and what it ultimately expresses, a phenomenon we describe as seeing it right but saying it wrong. To address this issue, we propose DualPD, a dual-perspective decoding refinement strategy that enhances the visual understanding without any additional training. DualPD consists of two components. (1) The layer-wise attention-guided contrastive logits module captures how the belief in the correct answer evolves by comparing output logits between layers that exhibit the largest attention shift. (2) The head-wise information filtering module suppresses low-contribution attention heads that focus on irrelevant regions, thereby improving attention quality within each layer. Experiments conducted on both the LLaVA and Qwen-VL model families across multiple multimodal benchmarks demonstrate that DualPD consistently improves accuracy without training, confirming its effectiveness and generalizability. The code will be released upon publication.
翻译:多模态大语言模型(MLLMs)在各类视觉-语言任务中展现出强大的能力。然而,其内部推理过程常存在关键的不一致性:尽管深层网络可能关注到正确的视觉区域,但最终预测却常被浅层网络的噪声注意力所误导。这导致模型内部理解与最终表达之间存在脱节,我们将此现象描述为“看得对却说错”。为解决这一问题,我们提出DualPD——一种无需额外训练即可增强视觉理解能力的双视角解码优化策略。DualPD包含两个核心组件:(1)层间注意力引导的对比对数模块,通过比较注意力偏移最大的相邻层之间的输出对数,捕捉模型对正确答案置信度的演化过程;(2)头级信息过滤模块,抑制那些关注无关区域的低贡献注意力头,从而提升单层内的注意力质量。在LLaVA和Qwen-VL系列模型上进行的多模态基准测试表明,DualPD无需训练即可持续提升模型准确率,验证了其有效性与泛化能力。相关代码将在论文发表时开源。