Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.
翻译:受Deepseek-R1在复杂文本任务中卓越推理能力的启发,许多工作尝试通过直接应用强化学习(RL)来激励多模态大语言模型(MLLMs)发展类似能力。然而,它们仍难以激活复杂的推理。本文并非孤立地审视多模态RL,而是深入探究当前的训练流程,并识别出三个关键现象:1)有效的冷启动初始化对于增强MLLM推理至关重要。有趣的是,我们发现仅使用精心挑选的文本数据进行初始化,其性能甚至能超越许多近期的多模态推理模型,这发生在进行多模态RL之前。2)应用于多模态RL的标准GRPO存在梯度停滞问题,这会降低训练稳定性和性能。3)在多模态RL阶段之后,进行后续的纯文本RL训练,能进一步提升多模态推理能力。这种分阶段训练方法有效地平衡了感知基础与认知推理能力的发展。通过整合上述洞见并解决多模态RL问题,我们提出了ReVisual-R1,在包括MathVerse、MathVision、WeMath、LogicVista、DynaMath以及具有挑战性的AIME2024和AIME2025在内的基准测试中,在开源7B参数MLLMs中达到了新的最先进水平。