Recently, Large Vision-Language Models (LVLMs) show remarkable performance across various domains. However, these models suffer from object hallucination. This study revisits the previous claim that the primary cause of such hallucination lies in the limited representational capacity of the vision encoder. Our analysis reveals that the capacity of the vision encoder itself is already enough for detecting object hallucination. Based on this insight, we propose a Fine-grained CLIPScore (F-CLIPScore), a simple yet effective evaluation metric that enhances object-level granularity by incorporating text embeddings at the noun phrase level. Evaluations on the OHD-Caps benchmark show that F-CLIPScore significantly outperforms conventional CLIPScore in accuracy by a large margin of 39.6% without additional training. We further validate F-CLIPScore by showing that LVLM trained with the data filtered using F-CLIPScore exhibits reduced hallucination.
翻译:近年来,大型视觉语言模型(LVLMs)在各个领域展现出卓越性能。然而,这些模型普遍存在目标幻觉问题。本研究重新审视了先前关于幻觉主要源于视觉编码器表征能力有限的观点。我们的分析表明,视觉编码器本身已具备检测目标幻觉的足够能力。基于这一发现,我们提出细粒度CLIPScore(F-CLIPScore)——一种简单而有效的评估指标,通过引入名词短语层级的文本嵌入来增强目标级粒度。在OHD-Caps基准测试中,F-CLIPScore以39.6%的显著优势大幅超越传统CLIPScore的准确率,且无需额外训练。我们进一步验证了F-CLIPScore的有效性:使用经F-CLIPScore筛选数据训练的LVLM表现出明显减轻的幻觉现象。