Recently, Large Vision-Language Models (LVLMs) show remarkable performance across various domains. However, these models suffer from object hallucination. In this work, we study object hallucination primarily in a discriminative, retrieval-style evaluation setting (OHD-Caps), rather than in free-form caption generation. This study revisits the previous claim that the cause of such hallucinations lies in the limited representational capacity of the vision encoder. Our analysis implies that the capacity of the vision encoder is not necessarily a major limiting factor in detecting object hallucination. Based on this insight, we propose Fine-grained CLIPScore (F-CLIPScore), a simple yet effective evaluation metric that enhances object-level granularity by incorporating text embeddings at the noun level. Evaluations on the OHD-Caps benchmark show that F-CLIPScore significantly outperforms conventional CLIPScore in accuracy by a large margin of 39.6% without additional training. We further demonstrate that F-CLIPScore-based data filtering reduces object hallucination in LVLM (4.9% in POPE accuracy after alignment pretraining). Our code is publicly available at https://github.com/abzb1/f-clip
翻译:近年来,大型视觉语言模型(LVLMs)在各个领域展现出卓越性能。然而,这些模型存在物体幻觉问题。本研究主要在判别式检索型评估设置(OHD-Caps)中探讨物体幻觉,而非自由形式的描述生成。本文重新审视了先前关于此类幻觉源于视觉编码器有限表征能力的论断。我们的分析表明,视觉编码器的能力不一定是检测物体幻觉的主要限制因素。基于这一发现,我们提出细粒度CLIPScore(F-CLIPScore)——一种简单而有效的评估指标,通过引入名词层级的文本嵌入来增强物体级粒度。在OHD-Caps基准测试中,F-CLIPScore在无需额外训练的情况下,以39.6%的显著优势大幅超越传统CLIPScore的准确率。我们进一步证明,基于F-CLIPScore的数据过滤能有效减少LVLM中的物体幻觉(在对齐预训练后POPE准确率提升4.9%)。代码已公开于https://github.com/abzb1/f-clip。