We present MM-Eureka, a multimodal reasoning model that successfully extends large-scale rule-based reinforcement learning (RL) to multimodal reasoning. While rule-based RL has shown remarkable success in improving LLMs' reasoning abilities in text domains, its application to multimodal settings has remained challenging. Our work reproduces key characteristics of text-based RL systems like DeepSeek-R1 in the multimodal space, including steady increases in accuracy reward and response length, and the emergence of reflection behaviors. We demonstrate that both instruction-tuned and pre-trained models can develop strong multimodal reasoning capabilities through rule-based RL without supervised fine-tuning, showing superior data efficiency compared to alternative approaches. We open-source our complete pipeline to foster further research in this area. We release all our codes, models, data, etc. at https://github.com/ModalMinds/MM-EUREKA
翻译:我们提出了MM-Eureka,一种多模态推理模型,成功地将基于规则的大规模强化学习(RL)扩展到多模态推理领域。尽管基于规则的RL在提升大语言模型(LLM)于文本领域的推理能力方面已展现出显著成效,但其在多模态场景中的应用仍面临挑战。我们的工作在多模态空间中复现了基于文本的RL系统(如DeepSeek-R1)的关键特性,包括准确率奖励与响应长度的稳步提升,以及反思行为的涌现。我们证明,无论是经过指令微调还是预训练的模型,均能通过基于规则的RL发展出强大的多模态推理能力,而无需监督微调,且相较于其他方法展现出更优的数据效率。我们开源了完整的技术流程以促进该领域的进一步研究。所有代码、模型、数据等均已发布于 https://github.com/ModalMinds/MM-EUREKA。