RL-based post-training has been widely adopted to enable interleaved visual and textual reasoning in unified multimodal models capable of both text and image generation. However, most existing approaches are built upon autoregressive (AR) unified models, which require full image regeneration during visual reasoning. In this work, we demonstrate that multimodal discrete diffusion models are effective alternatives to AR models for reinforcement learning in interleaved reasoning, owing to their ability to perform efficient visual rollouts via localized visual editing rather than full image-token regeneration. This reduces rollout computation during GRPO by 26.9\% compared to AR baselines, with minimal performance drop. Despite the improved efficiency, we find that joint reward assignment, which employs a shared reward signal across modalities, introduces cross-modal interference between unrelated image and text token sequences during RL updates. To address this issue, we propose factorized reward assignment, a strategy that assigns rewards independently to text and vision segments. With factorized reward assignment, our RL approach achieves an 11.2% improvement over joint reward assignment and a 38.04% improvement over the base model.
翻译:基于强化学习的后训练已被广泛用于实现统一多模态模型中的交错视觉与文本推理,此类模型兼具文本与图像生成能力。然而,现有方法多基于自回归统一模型,在视觉推理过程中需完整重建图像。本研究表明,多模态离散扩散模型可作为自回归模型在交错推理强化学习中的有效替代方案,其通过局部视觉编辑而非完整图像标记重建实现高效视觉推演。相较于自回归基线方法,该方法在GRPO训练中可将推演计算量降低26.9%,且性能损失极小。尽管效率提升显著,我们发现联合奖励分配机制(即跨模态共享同一奖励信号)会在强化学习更新阶段引发不相关图像与文本序列间的跨模态干扰。为解决此问题,我们提出分解式奖励分配策略,该策略为文本与视觉片段独立分配奖励。采用分解式奖励分配后,我们的强化学习方法相较于联合奖励分配方案提升11.2%,相较于基础模型提升38.04%。