Object hallucination has been an Achilles' heel which hinders the broader applications of large vision-language models (LVLMs). Object hallucination refers to the phenomenon that the LVLMs claim non-existent objects in the image. To mitigate the object hallucinations, instruction tuning and external model-based detection methods have been proposed, which either require large-scare computational resources or depend on the detection result of external models. However, there remains an under-explored field to utilize the LVLM itself to alleviate object hallucinations. In this work, we adopt the intuition that the LVLM tends to respond logically consistently for existent objects but inconsistently for hallucinated objects. Therefore, we propose a Logical Closed Loop-based framework for Object Hallucination Detection and Mitigation, namely LogicCheckGPT. In specific, we devise logical consistency probing to raise questions with logical correlations, inquiring about attributes from objects and vice versa. Whether their responses can form a logical closed loop serves as an indicator of object hallucination. As a plug-and-play method, it can be seamlessly applied to all existing LVLMs. Comprehensive experiments conducted on three benchmarks across four LVLMs have demonstrated significant improvements brought by our method, indicating its effectiveness and generality.
翻译:物体幻觉一直是阻碍大型视觉语言模型(LVLMs)更广泛应用的一个致命弱点。物体幻觉指的是LVLMs声称图像中存在实际不存在的物体。为了缓解物体幻觉,研究者提出了指令微调和基于外部模型的检测方法,但这些方法要么需要大规模计算资源,要么依赖于外部模型的检测结果。然而,如何利用LVLM自身来减轻物体幻觉仍是一个尚未充分探索的领域。在本工作中,我们基于以下直觉:LVLM对于真实存在的物体倾向于给出逻辑一致的响应,而对于幻觉物体则表现出逻辑不一致性。因此,我们提出了一种基于逻辑闭环的物体幻觉检测与缓解框架,命名为LogicCheckGPT。具体而言,我们设计了逻辑一致性探测方法,通过提出具有逻辑关联性的问题,从物体询问其属性,反之亦然。它们的响应能否形成一个逻辑闭环,即可作为物体幻觉的判定指标。作为一种即插即用方法,它可以无缝应用于所有现有的LVLMs。在三个基准测试集上对四种LVLMs进行的综合实验表明,我们的方法带来了显著改进,验证了其有效性和普适性。