DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of $\sim$6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. The datasets and code will be released in: https://github.com/Osilly/Vision-R1 .
翻译:DeepSeek-R1-Zero 已成功证明,仅通过强化学习(RL)即可在大语言模型(LLMs)中涌现出推理能力。受此突破启发,我们探索如何利用强化学习来增强多模态大语言模型(MLLMs)的推理能力。然而,由于缺乏大量高质量的多模态推理数据,直接使用强化学习进行训练难以激活 MLLMs 中诸如提问与反思等复杂推理能力。为解决此问题,我们提出了推理型多模态大语言模型 Vision-R1,以提升多模态推理能力。具体而言,我们首先通过模态桥接与数据过滤,利用现有的一个 MLLM 和 DeepSeek-R1,在没有人工标注的情况下构建了一个高质量的多模态思维链(CoT)数据集,即包含 20 万条样本的 Vision-R1-cold 数据集。该数据集作为 Vision-R1 的冷启动初始化数据。为了缓解冷启动后因过度思考带来的优化挑战,我们提出了渐进式思维抑制训练(PTST)策略,并在一个包含 1 万条样本的多模态数学数据集上,采用组相对策略优化(GRPO)与硬格式化结果奖励函数,逐步优化模型学习正确且复杂推理过程的能力。全面的实验表明,我们的模型在各种多模态数学推理基准测试中平均提升了约 6%。Vision-R1-7B 在广泛使用的 MathVista 基准测试上达到了 73.5% 的准确率,仅比领先的推理模型 OpenAI O1 低 0.4%。数据集和代码将在以下地址发布:https://github.com/Osilly/Vision-R1 。