Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: https://janus-pro-r1.github.io.
翻译:近期,多模态大语言模型(MLLMs)的研究致力于统一视觉理解与生成能力。然而,这两种能力在很大程度上仍相互独立,如同封装在同一模型内的两个独立功能。因此,视觉理解未能增强视觉生成,且大语言模型的推理机制尚未被充分整合以革新图像生成。本文提出实现视觉理解与生成的协同共进化,将图像生成推进为一个迭代的内省过程。我们引入了一种两阶段训练方法:监督微调阶段,通过教授模型为视觉生成生成真实的思维链,赋予其基础能力;强化学习阶段,则通过探索-利用权衡来激发其全部潜力。最终,我们在视觉生成中解锁了“顿悟”时刻,将多模态大语言模型从文本到图像任务推进到统一的图像生成。大量实验表明,我们的模型不仅在文本到图像生成和图像编辑方面表现出色,还能作为一个具备增强视觉理解能力的卓越图像语义评估器。项目页面:https://janus-pro-r1.github.io。