Multimodal Large Language Models (MLLMs) are making significant progress in multimodal reasoning. Early approaches focus on pure text-based reasoning. More recent studies have incorporated multimodal information into the reasoning steps; however, they often follow a single task-specific reasoning pattern, which limits their generalizability across various multimodal tasks. In fact, there are numerous multimodal tasks requiring diverse reasoning skills, such as zooming in on a specific region or marking an object within an image. To address this, we propose unified generative multimodal reasoning, which unifies diverse multimodal reasoning skills by generating intermediate images during the reasoning process. We instantiate this paradigm with Omni-R1, a two-stage SFT+RL framework featuring perception alignment loss and perception reward, thereby enabling functional image generation. Additionally, we introduce Omni-R1-Zero, which eliminates the need for multimodal annotations by bootstrapping step-wise visualizations from text-only reasoning data. Empirical results show that Omni-R1 achieves unified generative reasoning across a wide range of multimodal tasks, and Omni-R1-Zero can match or even surpass Omni-R1 on average, suggesting a promising direction for generative multimodal reasoning.
翻译:多模态大语言模型(MLLMs)在多模态推理方面正取得显著进展。早期方法侧重于纯文本推理。近期的研究已将多模态信息融入推理步骤;然而,它们通常遵循单一任务特定的推理模式,这限制了其在各种多模态任务上的泛化能力。实际上,存在大量需要多样化推理技能的多模态任务,例如聚焦于图像的特定区域或在图像中标记对象。为解决此问题,我们提出了统一生成式多模态推理,通过在推理过程中生成中间图像来统一多样化的多模态推理技能。我们通过Omni-R1实例化了这一范式,这是一个采用感知对齐损失和感知奖励的两阶段SFT+RL框架,从而实现了功能性图像生成。此外,我们引入了Omni-R1-Zero,它通过从纯文本推理数据中自举逐步可视化,消除了对多模态标注的需求。实证结果表明,Omni-R1在广泛的多模态任务上实现了统一的生成式推理,并且Omni-R1-Zero在平均性能上能够匹配甚至超越Omni-R1,这为生成式多模态推理指明了一个有前景的方向。