Large Vision-Language Models (LVLMs) achieve strong performance on many multimodal tasks, but object hallucinations severely undermine their reliability. Most existing studies focus on the text modality, attributing hallucinations to overly strong language priors and insufficient visual grounding. In contrast, we observe that abnormal attention patterns within the visual modality can also give rise to hallucinated objects. Building on this observation, we propose Segmentation-based Attention Entropy (SAE), which leverages semantic segmentation to quantify visual attention uncertainty in an object-level semantic space. Based on SAE, we further design a reliability score for hallucination detection and an SAE-guided attention adjustment method that modifies visual attention at inference time to mitigate hallucinations. We evaluate our approach on public benchmarks and in real embodied multimodal scenarios with quadruped robots. Experimental results show that SAE substantially reduces object hallucinations without any additional training cost, thereby enabling more trustworthy LVLM-driven perception and decision-making.
翻译:大型视觉-语言模型(LVLMs)在许多多模态任务中表现出色,但物体幻觉严重损害了其可靠性。现有研究大多聚焦于文本模态,将幻觉归因于过强的语言先验和不足的视觉基础。与此相反,我们观察到视觉模态内的异常注意力模式同样可能引发幻觉物体。基于此观察,我们提出基于分割的注意力熵(SAE),该方法利用语义分割在物体级语义空间中量化视觉注意力不确定性。基于SAE,我们进一步设计了用于幻觉检测的可信度评分,以及一种SAE引导的注意力调整方法,该方法在推理时修改视觉注意力以缓解幻觉。我们在公开基准以及四足机器人的真实具身多模态场景中评估了该方法。实验结果表明,SAE无需任何额外训练成本即可显著减少物体幻觉,从而赋能更可信的LVLM驱动感知与决策。