LVLMs have achieved strong multimodal reasoning capabilities but remain prone to hallucinations, producing outputs inconsistent with visual inputs or user instructions. Existing training-free methods, including contrastive decoding and auxiliary expert models, which incur several times more computational overhead and may introduce potential interference, as well as static internal signal enhancement, are often vulnerable to the attention sink phenomenon. We find that internal Positive Attention Dynamics (PAD) in LVLMs naturally reveal semantically core visual regions under the distortions of attention sinks. Based on this, we propose Positive Attention Dynamics Enhancement (PADE), a training-free attention intervention that constructs a PAD map to identify semantically core visual regions, applies per-head Median Absolute Deviation Scaling to adaptively control the intervention strength, and leverages System-Token Compensation to maintain attention to complex user instructions and support long-term output consistency. Experiments on multiple LVLMs and benchmarks show that PADE improves visual grounding and reduces hallucinations, validating the effectiveness of leveraging internal attention dynamics for reliable multimodal reasoning.
翻译:大型视觉语言模型(LVLM)已具备强大的多模态推理能力,但仍易产生幻觉,生成与视觉输入或用户指令不一致的输出。现有免训练方法——包括对比解码与引入额外专家模型(会带来数倍计算开销并可能引入潜在干扰)以及静态内部信号增强——常易受注意力沉没现象影响。我们发现,LVLM内部的**正向注意力动态**(Positive Attention Dynamics, PAD)能在注意力沉没的干扰下自然揭示语义层面的核心视觉区域。基于此,我们提出**正向注意力动态增强**(Positive Attention Dynamics Enhancement, PADE),这是一种免训练的注意力干预方法:该方法构建PAD图以识别语义核心视觉区域,采用基于头的中位数绝对偏差缩放自适应控制干预强度,并利用系统令牌补偿机制以保持对复杂用户指令的关注并支持长期输出一致性。在多个LVLM与基准测试上的实验表明,PADE能有效提升视觉定位能力并减少幻觉,验证了利用内部注意力动态实现可靠多模态推理的有效性。