In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the ability of ``slow thinking" into multimodal large language models (MLLMs). Contrary to existing methods that rely on direct or fast thinking, our key idea is to construct long chains of thought (CoT) consisting of atomic actions in a step-by-step manner, guiding MLLMs to perform complex reasoning. To this end, we design a novel AtomThink framework composed of three key modules: (i) a CoT annotation engine that automatically generates high-quality CoT annotations to address the lack of high-quality visual mathematical data; (ii) an atomic step fine-tuning strategy that jointly optimizes an MLLM and a policy reward model (PRM) for step-wise reasoning; and (iii) four different search strategies that can be applied with the PRM to complete reasoning. Additionally, we propose AtomMATH, a large-scale multimodal dataset of long CoTs, and an atomic capability evaluation metric for mathematical tasks. Extensive experimental results show that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving approximately 50\% relative accuracy gains on MathVista and 120\% on MathVerse. To support the advancement of multimodal slow-thinking models, we will make our code and dataset publicly available on https://github.com/Quinn777/AtomThink.
翻译:本文通过将"慢思考"能力融入多模态大语言模型(MLLMs),致力于解决多模态数学推理这一挑战性任务。与现有依赖直接或快速思考的方法不同,我们的核心思想是逐步构建由原子操作构成的长链思维(CoT),引导MLLMs执行复杂推理。为此,我们设计了新颖的AtomThink框架,包含三个关键模块:(i)CoT标注引擎:通过自动生成高质量CoT标注,解决高质量视觉数学数据匮乏的问题;(ii)原子步骤微调策略:联合优化MLLM与策略奖励模型(PRM),实现逐步推理;(iii)四种可与PRM配合使用的搜索策略,以完成推理过程。此外,我们提出了AtomMATH——一个包含长链思维的大规模多模态数据集,以及面向数学任务的原子能力评估指标。大量实验结果表明,所提出的AtomThink框架显著提升了基线MLLMs的性能,在MathVista上获得约50%的相对准确率提升,在MathVerse上达到120%的提升。为促进多模态慢思考模型的发展,我们将在https://github.com/Quinn777/AtomThink 公开代码与数据集。