Multimodal Large Language Models (MLLMs) frequently hallucinate due to their reliance on fragile, linear reasoning and weak visual grounding. We propose Visual Attention Reasoning (VAR), a reinforcement learning framework that reformulates reasoning as a hierarchical search with self-verification. VAR enforces traceable evidence grounding by generating explicit bounding boxes, guided by a novel reward function combining geometric precision and semantic sufficiency. Furthermore, it replaces linear Chain-of-Thought with a tree-search policy capable of backtracking to correct logical errors. Theoretical analysis validates the framework's reliability, and extensive experiments demonstrate that VAR significantly outperforms state-of-the-art methods on complex hallucination and safety benchmarks.
翻译:多模态大语言模型(MLLMs)因其依赖脆弱、线性的推理机制和薄弱的视觉基础而频繁产生幻觉。本文提出视觉注意力推理(VAR)——一种强化学习框架,将推理重新构建为具有自验证功能的分层搜索过程。VAR通过生成显式的边界框来强制实现可追溯的证据基础,其指导依据是一种结合几何精度与语义充分性的新型奖励函数。此外,该方法用能够回溯以修正逻辑错误的树搜索策略取代了线性的思维链。理论分析验证了该框架的可靠性,大量实验表明,VAR在复杂的幻觉与安全性基准测试中显著优于现有最先进方法。