Large Vision-Language Models (LVLMs) excel in diverse cross-modal tasks. However, object hallucination, where models produce plausible but inaccurate object descriptions, remains a significant challenge. In contrast to previous work focusing on LLM components, this paper is the first to trace LVLM hallucinations to visual encoders and identifies three key issues: statistical bias, inherent bias, and vulnerability. To address these challenges, we propose SHIELD, a training-free framework that mitigates hallucinations through three strategies: re-weighting visual tokens to reduce statistical bias, introducing noise-derived tokens to counter inherent bias, and applying adversarial attacks with contrastive decoding to address vulnerability. Experiments demonstrate that SHIELD effectively mitigates object hallucinations across diverse benchmarks and LVLM families. Moreover, SHIELD achieves strong performance on the general LVLM benchmark, highlighting its broad applicability. Code is available at https://github.com/hukcc/SHIELD.
翻译:大视觉语言模型(LVLMs)在多种跨模态任务中表现出色。然而,对象幻觉问题——即模型生成看似合理但不准确的对象描述——仍然是一个重大挑战。与以往聚焦于大语言模型(LLM)组件的研究不同,本文首次将LVLM的幻觉根源追溯至视觉编码器,并识别出三个关键问题:统计偏置、固有偏置和脆弱性。为应对这些挑战,我们提出了SHIELD,一个无需训练即可通过三种策略缓解幻觉的框架:通过重加权视觉标记来减少统计偏置;引入噪声衍生标记以对抗固有偏置;以及应用结合对比解码的对抗性攻击来解决脆弱性。实验表明,SHIELD能有效缓解多种基准测试和不同LVLM家族中的对象幻觉。此外,SHIELD在通用LVLM基准测试上取得了强劲性能,凸显了其广泛的适用性。代码可在 https://github.com/hukcc/SHIELD 获取。