Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.
翻译:多模态扩散大语言模型(MDLLMs)通过并行掩码解码实现高并发生成,但其架构仍易产生多模态幻觉。该结构性缺陷源于算法层面的漏洞:解码器基于文本似然性对候选词元进行排序,却未验证局部视觉支持。我们证明,这种纯文本排序导致目标错配(objective mismatch),即语言概率质量作为预设多模态任务的不当代理指标。据此,我们将幻觉重新解读为局部优化误差——解码器利用语言捷径最大化代理分数而牺牲视觉基础的现象。为解决该目标错配问题,我们提出VISAGE——一种在推理阶段校准目标的免训练解码框架。VISAGE通过量化交叉注意力分布的空间熵来估计代理差异。通过强制跨注意力头的局部一致性共识,该方法惩罚空间均匀分布并重新排序词元承诺,以偏好视觉基础输出。我们提供了分析稳定性保证,证明VISAGE在估计误差下能维持有界的目标损失。在幻觉敏感型与通用型基准上的评估表明该框架具有鲁棒性,在MMMU-val和HallusionBench上分别取得8.59%与7.75%的相对性能提升。