Current large vision-language models (LVLMs) achieve remarkable progress, yet there remains significant uncertainty regarding their ability to accurately apprehend visual details, that is, in performing detailed captioning. To address this, we introduce \textit{CCEval}, a GPT-4 assisted evaluation method tailored for detailed captioning. Interestingly, while LVLMs demonstrate minimal object existence hallucination in existing VQA benchmarks, our proposed evaluation reveals continued susceptibility to such hallucinations. In this paper, we make the first attempt to investigate and attribute such hallucinations, including image resolution, the language decoder size, and instruction data amount, quality, granularity. Our findings underscore the unwarranted inference when the language description includes details at a finer object granularity than what the vision module can ground or verify, thus inducing hallucination. To control such hallucinations, we further attribute the reliability of captioning to contextual knowledge (involving only contextually grounded objects) and parametric knowledge (containing inferred objects by the model). Thus, we introduce $\textit{HallE-Switch}$, a controllable LVLM in terms of $\textbf{Hall}$ucination in object $\textbf{E}$xistence. HallE-Switch can condition the captioning to shift between (i) exclusively depicting contextual knowledge for grounded objects and (ii) blending it with parametric knowledge to imagine inferred objects. Our method reduces hallucination by 44% compared to LLaVA$_{7B}$ and maintains the same object coverage.
翻译:当前大型视觉语言模型(LVLMs)取得了显著进展,但它们在精确理解视觉细节(即执行细粒度描述)方面仍存在显著不确定性。为解决此问题,我们提出了一种专为细粒度描述设计的GPT-4辅助评估方法\textit{CCEval}。有趣的是,尽管LVLMs在现有VQA基准中表现出极少的对象存在幻觉,但我们的评估方法揭示了它们对此类幻觉的持续易感性。本文首次尝试探究并归因这些幻觉的影响因素,包括图像分辨率、语言解码器规模以及指令数据量、质量与粒度。我们的研究强调了一种不当推理现象:当语言描述包含比视觉模块所能验证或确认的更细粒度的对象细节时,会诱导幻觉产生。为控制此类幻觉,我们进一步将描述的可靠性归因于上下文知识(仅包含基于上下文的可验证对象)与参数化知识(包含模型推理出的对象)。由此,我们提出$\textit{HallE-Switch}$,一种在对象存在幻觉方面可控的LVLM。HallE-Switch能够调节描述在以下两种模式间切换:(i)仅描述上下文知识的可验证对象,(ii)混合参数化知识以想象推理出的对象。相比LLaVA$_{7B}$,我们的方法在保持相同对象覆盖度的同时,将幻觉减少了44%。