We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.
翻译:本研究针对多模态大语言模型中的物体幻觉问题,通过构建对象对齐的辅助视角改进了视觉对比解码方法。我们利用自监督视觉Transformer中的对象中心注意力机制,通过移除最显著的视觉证据构建辅助视角,从而抑制无依据的标记生成并产生更强的对比信号。该方法具有提示无关性、模型无关性,能够以极低计算开销(仅需单次可缓存的向前传播)无缝集成到现有VCD流程中。实验结果表明,本方法在两种主流多模态大语言模型的两个常用物体幻觉基准测试中均取得了稳定提升。