Object hallucination in Large Vision-Language Models (LVLMs) severely compromises their reliability in real-world applications, posing a critical barrier to their deployment in high-stakes scenarios such as autonomous driving and medical image analysis. Through systematic empirical investigation, we identify that the imbalanced attention allocation, both across modalities (i.e., vision and language) and within modalities (among individual tokens), exhibits a strong causal correlation with the occurrence of object hallucination. Leveraging this insight, we introduce a novel concept termed attention imbalance, which not only quantifies the degree of attention disparity but also visually delineates the underlying patterns (e.g., over-attentiveness to irrelevant language tokens or under-attentiveness to discriminative visual features) that drive object hallucination. To mitigate object hallucination, we further propose Attention Imbalance Rectification (AIR), a lightweight decoding-time intervention method that reallocates attention weights and adjusts attention distributions to rectify modality-wise and token-wise imbalances. Extensive evaluations on four mainstream LVLMs and three benchmarks (CHAIR, POPE, and MM-Vet) with seven baselines demonstrate that AIR consistently reduces object hallucination rates, achieving up to a 35.1% reduction compared to the baselines, while improving up to 15.9% of LVLMs' general capability across diverse vision-language tasks.
翻译:大型视觉语言模型(LVLMs)中的目标幻觉严重损害了其在实际应用中的可靠性,成为其在自动驾驶和医学图像分析等高危场景部署的关键障碍。通过系统的实证研究,我们发现跨模态(即视觉与语言)以及模态内部(各词元之间)的注意力分配不平衡与目标幻觉的产生存在强因果关联。基于这一洞察,我们引入了一个名为"注意力不平衡"的新概念,它不仅量化了注意力差异程度,还直观揭示了驱动目标幻觉的潜在模式(例如,对无关语言词元的过度关注,或对判别性视觉特征的关注不足)。为缓解目标幻觉,我们进一步提出注意力不平衡矫正(AIR)方法——一种轻量级的解码时干预技术,通过重新分配注意力权重并调整注意力分布来矫正模态层面和词元层面的不平衡。在四个主流LVLM和三个基准(CHAIR、POPE和MM-Vet)上进行的广泛评估(涵盖七个基线方法)表明,AIR能持续降低目标幻觉率,与基线相比最大降幅达35.1%,同时在不同视觉语言任务上使LVLM的通用能力提升高达15.9%。