Despite great progress, existing multimodal large language models (MLLMs) are prone to visual hallucination, greatly impeding their trustworthy applications. In this paper, we study this problem from the perspective of visual-spatial reasoning, and propose a new learning task for MLLMs, termed Grounded Chain-of-Thought (GCoT). Different from recent visual CoT studies, which focus more on visual knowledge reasoning, GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis. To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images. Besides, a comprehensive consistency evaluation system is also introduced, including the metrics of answer accuracy, grounding accuracy and answer-grounding consistency. We further design and conduct a bunch of experiments on 12 advanced MLLMs, and reveal some notable findings: i. most MLLMs performs poorly on the consistency evaluation, indicating obvious visual hallucination; ii. visual hallucination is not directly related to the parameter size and general multimodal performance, i.e., a larger and stronger MLLM is not less affected by this issue. Lastly, we also demonstrate that the proposed dataset can help existing MLLMs to well cultivate their GCoT capability and reduce the inconsistent answering significantly. Moreover, their GCoT can be also generalized to exiting multimodal tasks, such as open-world QA and REC.
翻译:尽管取得了巨大进展,现有的多模态大语言模型(MLLMs)仍易产生视觉幻觉,严重阻碍了其可信应用。本文从视觉空间推理的角度研究该问题,并提出一种面向MLLMs的新型学习任务——基于视觉空间推理的链式思维(GCoT)。与近期关注视觉知识推理的视觉链式思维研究不同,GCoT致力于帮助MLLMs逐步识别并定位相关视觉线索,从而以定位坐标作为直观依据预测正确答案。为推进该任务,我们精心设计并构建了多模态基于视觉空间推理的链式思维数据集(MM-GCoT),包含5,033张图像对应的24,022个GCoT样本。此外,我们还引入了包含答案准确率、定位准确率及答案-定位一致性指标的综合一致性评估体系。通过对12个先进MLLMs的系统实验,我们揭示了若干重要发现:i. 多数MLLMs在一致性评估中表现欠佳,表明存在明显视觉幻觉;ii. 视觉幻觉与参数规模及通用多模态性能无直接关联,即规模更大、能力更强的MLLM并未更少受此问题影响。最后,我们验证了所提数据集能有效帮助现有MLLMs培养GCoT能力,显著降低不一致回答。此外,其GCoT能力可泛化至开放世界问答和指代式目标检测等现有多模态任务。