With the advent of large language models(LLMs) enhanced by the chain-of-thought(CoT) methodology, visual reasoning problem is usually decomposed into manageable sub-tasks and tackled sequentially with various external tools. However, such a paradigm faces the challenge of the potential "determining hallucinations" in decision-making due to insufficient visual information and the limitation of low-level perception tools that fail to provide abstract summaries necessary for comprehensive reasoning. We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. This paper delves into the realm of multimodal CoT to solve intricate visual reasoning tasks with multimodal large language models(MLLMs) and their cognitive capability. To this end, we propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Cantor first acts as a decision generator and integrates visual inputs to analyze the image and problem, ensuring a closer alignment with the actual context. Furthermore, Cantor leverages the advanced cognitive functions of MLLMs to perform as multifaceted experts for deriving higher-level information, enhancing the CoT generation process. Our extensive experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance across two complex visual reasoning datasets, without necessitating fine-tuning or ground-truth rationales. Project Page: https://ggg0919.github.io/cantor/ .
翻译:随着由思维链(CoT)方法增强的大型语言模型(LLM)的出现,视觉推理问题通常被分解为可管理的子任务,并借助各种外部工具逐步解决。然而,这种范式面临因视觉信息不足导致的决策中潜在“确定性幻觉”的挑战,同时受限于低层次感知工具无法提供综合性推理所需的抽象总结。我们认为,视觉上下文的获取与逻辑推理的融合是解决视觉推理任务的关键。本文深入探索多模态思维链领域,旨在利用多模态大语言模型(MLLM)及其认知能力解决复杂的视觉推理任务。为此,我们提出一种创新的多模态思维链框架,名为Cantor,其特点在于采用感知-决策架构。Cantor首先作为决策生成器,整合视觉输入以分析图像与问题,确保与实际上下文更紧密地对齐。此外,Cantor利用MLLM的高级认知功能,扮演多领域专家角色以推导更高层次信息,从而增强思维链生成过程。我们的大量实验证明了所提出框架的有效性,在无需微调或真值理由的情况下,两个复杂视觉推理数据集上的多模态思维链性能均实现了显著提升。项目页面:https://ggg0919.github.io/cantor/。