Despite great progress, existing multimodal large language models (MLLMs) are prone to visual hallucination, greatly impeding their trustworthy applications. In this paper, we study this problem from the perspective of visual-spatial reasoning, and propose a new learning task for MLLMs, termed Grounded Chain-of-Thought (GCoT). Different from recent visual CoT studies, which focus more on visual knowledge reasoning, GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis. To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images. Besides, a comprehensive consistency evaluation system is also introduced, including the metrics of answer accuracy, grounding accuracy and answer-grounding consistency. We further design and conduct a bunch of experiments on 12 advanced MLLMs, and reveal some notable findings: i. most MLLMs performs poorly on the consistency evaluation, indicating obvious visual hallucination; ii. visual hallucination is not directly related to the parameter size and general multimodal performance, i.e., a larger and stronger MLLM is not less affected by this issue. Lastly, we also demonstrate that the proposed dataset can help existing MLLMs to well cultivate their GCoT capability and reduce the inconsistent answering significantly. Moreover, their GCoT can be also generalized to exiting multimodal tasks, such as open-world QA and REC.
翻译:尽管取得了巨大进展,现有的多模态大语言模型(MLLMs)仍容易产生视觉幻觉,严重阻碍了其可信应用。本文从视觉空间推理的角度研究这一问题,并提出了一种面向MLLMs的新学习任务——基于视觉空间推理的链式思维(GCoT)。与近期侧重于视觉知识推理的视觉链式思维研究不同,GCoT致力于帮助MLLMs逐步识别并定位相关视觉线索,从而以定位坐标作为直观依据预测正确答案。为支持该任务,我们精心设计并构建了多模态基于视觉空间推理的链式思维数据集(MM-GCoT),包含5,033张图像对应的24,022个GCoT样本。此外,我们还引入了一套全面的连贯性评估体系,包括答案准确率、定位准确率及答案-定位一致性等指标。我们进一步设计并在12个先进MLLMs上开展了一系列实验,揭示了若干重要发现:i. 大多数MLLMs在连贯性评估中表现不佳,表明存在明显的视觉幻觉;ii. 视觉幻觉与模型参数量及通用多模态性能无直接关联,即更大更强的MLLM未必能更好抵御该问题。最后,我们证明了所提出的数据集能有效帮助现有MLLMs培养GCoT能力,并显著减少不一致回答。此外,其GCoT能力亦可泛化至现有多模态任务,如开放世界问答和指代表达理解。