According to the Test-Time Scaling, the integration of External Slow-Thinking with the Verify mechanism has been demonstrated to enhance multi-round reasoning in large language models (LLMs). However, in the multimodal (MM) domain, there is still a lack of a strong MM-Verifier. In this paper, we introduce MM-Verifier and MM-Reasoner to enhance multimodal reasoning through longer inference and more robust verification. First, we propose a two-step MM verification data synthesis method, which combines a simulation-based tree search with verification and uses rejection sampling to generate high-quality Chain-of-Thought (COT) data. This data is then used to fine-tune the verification model, MM-Verifier. Additionally, we present a more efficient method for synthesizing MMCOT data, bridging the gap between text-based and multimodal reasoning. The synthesized data is used to fine-tune MM-Reasoner. Our MM-Verifier outperforms all larger models on the MathCheck, MathVista, and MathVerse benchmarks. Moreover, MM-Reasoner demonstrates strong effectiveness and scalability, with performance improving as data size increases. Finally, our approach achieves strong performance when combining MM-Reasoner and MM-Verifier, reaching an accuracy of 65.3 on MathVista, surpassing GPT-4o (63.8) with 12 rollouts.
翻译:根据测试时间缩放原理,外部慢思考与验证机制的整合已被证明能够增强大型语言模型(LLM)的多轮推理能力。然而,在多模态(MM)领域,目前仍缺乏一个强大的MM验证器。本文中,我们引入了MM-Verifier和MM-Reasoner,通过更长的推理过程和更鲁棒的验证来增强多模态推理。首先,我们提出了一种两步MM验证数据合成方法,该方法将基于模拟的树搜索与验证相结合,并利用拒绝采样来生成高质量的思维链(COT)数据。这些数据随后用于微调验证模型MM-Verifier。此外,我们提出了一种更高效的MMCOT数据合成方法,以弥合基于文本的推理与多模态推理之间的差距。合成数据用于微调MM-Reasoner。我们的MM-Verifier在MathCheck、MathVista和MathVerse基准测试中超越了所有更大的模型。此外,MM-Reasoner展现出强大的有效性和可扩展性,其性能随着数据规模的增加而提升。最终,我们的方法在结合MM-Reasoner和MM-Verifier时取得了强劲的性能,在MathVista上达到了65.3的准确率,以12次推演超越了GPT-4o(63.8)。