Large Visual Language Models (LVLMs) struggle with hallucinations in visual instruction following task(s), limiting their trustworthiness and real-world applicability. We propose Pelican -- a novel framework designed to detect and mitigate hallucinations through claim verification. Pelican first decomposes the visual claim into a chain of sub-claims based on first-order predicates. These sub-claims consist of (predicate, question) pairs and can be conceptualized as nodes of a computational graph. We then use Program-of-Thought prompting to generate Python code for answering these questions through flexible composition of external tools. Pelican improves over prior work by introducing (1) intermediate variables for precise grounding of object instances, and (2) shared computation for answering the sub-question to enable adaptive corrections and inconsistency identification. We finally use reasoning abilities of LLM to verify the correctness of the the claim by considering the consistency and confidence of the (question, answer) pairs from each sub-claim. Our experiments reveal a drop in hallucination rate by $\sim$8%-32% across various baseline LVLMs and a 27% drop compared to approaches proposed for hallucination mitigation on MMHal-Bench. Results on two other benchmarks further corroborate our results.
翻译:大型视觉语言模型在视觉指令跟随任务中存在幻觉问题,这限制了其可信度与实际应用价值。我们提出Pelican——一种通过声明验证来检测与缓解幻觉的新型框架。Pelican首先基于一阶谓词将视觉声明分解为子声明链。这些子声明由(谓词,问题)对组成,可被概念化为计算图的节点。随后,我们通过思维程序提示生成Python代码,借助外部工具的灵活组合来回答这些问题。Pelican通过引入以下两点改进了先前工作:(1)用于精确锚定对象实例的中间变量;(2)用于回答子问题的共享计算机制,以实现自适应校正与不一致性识别。最后,我们利用大语言模型的推理能力,通过考量各子声明中(问题,答案)对的一致性与置信度来验证原始声明的正确性。实验表明,在不同基线大型视觉语言模型上,幻觉率降低了约8%-32%;在MMHal-Bench基准测试中,相较于现有幻觉缓解方法,幻觉率降低了27%。在另外两个基准测试上的结果进一步验证了我们的结论。