In this paper, we address the challenging task of multimodal reasoning by incorporating the notion of ``slow thinking'' into multimodal large language models (MLLMs). Our core idea is that models can learn to adaptively use different levels of reasoning to tackle questions of varying complexity. We propose a novel paradigm of Self-structured Chain of Thought (SCoT), which consists of minimal semantic atomic steps. Unlike existing methods that rely on structured templates or free-form paradigms, our method not only generates flexible CoT structures for various complex tasks but also mitigates the phenomenon of overthinking for easier tasks. To introduce structured reasoning into visual cognition, we design a novel AtomThink framework with four key modules: (i) a data engine to generate high-quality multimodal reasoning paths; (ii) a supervised fine-tuning (SFT) process with serialized inference data; (iii) a policy-guided multi-turn inference method; and (iv) an atomic capability metric to evaluate the single-step utilization rate. Extensive experiments demonstrate that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving more than 10\% average accuracy gains on MathVista and MathVerse. Compared to state-of-the-art structured CoT approaches, our method not only achieves higher accuracy but also improves data utilization by 5 $\times$ and boosts inference efficiency by 85.3\%. Our code is publicly available at https://github.com/Kun-Xiang/AtomThink.
翻译:本文通过将“慢思考”概念引入多模态大语言模型(MLLMs),致力于解决多模态推理这一具有挑战性的任务。我们的核心思想是:模型能够学习自适应地运用不同层级的推理能力来处理复杂度各异的问题。我们提出了一种新颖的自结构化思维链(SCoT)范式,该范式由最小语义原子步骤构成。与依赖结构化模板或自由范式的现有方法不同,我们的方法不仅能为各类复杂任务生成灵活的思维链结构,还能缓解模型在面对简单任务时的“过度思考”现象。为了将结构化推理引入视觉认知,我们设计了包含四个关键模块的AtomThink框架:(i)用于生成高质量多模态推理路径的数据引擎;(ii)基于序列化推理数据的监督微调(SFT)流程;(iii)策略引导的多轮推理方法;(iv)评估单步利用率的原子能力度量指标。大量实验表明,所提出的AtomThink显著提升了基线MLLMs的性能,在MathVista和MathVerse基准上平均准确率提升超过10%。相较于最先进的结构化思维链方法,我们的方法不仅实现了更高的准确率,还将数据利用率提升了5倍,推理效率提高了85.3%。我们的代码已公开于https://github.com/Kun-Xiang/AtomThink。