Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.
翻译:扩散语言模型(dLLMs)最近作为一种有前景的自回归大语言模型替代方案出现。最新研究进一步将其扩展到多模态理解与生成任务。本文提出LaViDa-R1,一种多模态通用推理dLLM。与现有通过任务特定强化学习构建推理dLLM的工作不同,LaViDa-R1以统一方式整合了多样化的多模态理解与生成任务。具体而言,LaViDa-R1采用了一种新颖的统一后训练框架,无缝整合了监督微调(SFT)与多任务强化学习(RL)。该模型运用了多项创新训练技术,包括答案强制、树搜索和互补似然估计,以提升效能与可扩展性。大量实验表明,LaViDa-R1在视觉数学推理、推理密集型定位和图像编辑等广泛多模态任务上均表现出强大性能。