While Large Vision-Language Models (LVLMs) achieve strong performance in multimodal tasks, hallucinations continue to hinder their reliability. Among the three categories of hallucinations, which include object, attribute, and relation, relation hallucinations account for the largest proportion but have received the least attention. To address this issue, we propose ChainMPQ (Multi-Perspective Questions guided Interleaved Chain of Image and Text), a training-free method that improves relational inference in LVLMs by utilizing accumulated textual and visual memories. ChainMPQ first extracts subject and object keywords from the question to enhance the corresponding image regions. It then constructs multi-perspective questions that focus on the three core components of a relationship: the subject, the object, and the relation that links them. These questions are sequentially input to the model, with textual and visual memories from earlier steps providing supporting context for subsequent ones, thereby forming an interleaved chain of images and text that guides progressive relational reasoning. Experiments on multiple LVLMs and benchmarks show that ChainMPQ substantially reduces relation hallucinations, while ablation studies further validate the effectiveness of its three core modules.
翻译:尽管大型视觉语言模型在多模态任务中展现出强大性能,幻觉现象仍持续影响其可靠性。在包含物体、属性和关系这三类幻觉中,关系幻觉占比最高却最受忽视。为解决此问题,我们提出ChainMPQ(多视角问题引导的交错图文推理链),这是一种无需训练的方法,通过利用累积的文本与视觉记忆来改进大型视觉语言模型中的关系推理能力。ChainMPQ首先从问题中提取主语和宾语关键词以增强对应图像区域,随后构建聚焦于关系三个核心要素(主语、宾语及连接二者的关系)的多视角问题。这些问题被顺序输入模型,早期步骤产生的文本与视觉记忆为后续步骤提供支持性上下文,从而形成引导渐进式关系推理的交错图文推理链。在多个大型视觉语言模型和基准测试上的实验表明,ChainMPQ能显著减少关系幻觉,消融研究进一步验证了其三个核心模块的有效性。