Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason-Reflect-Refine (R3) framework. This innovative algorithm re-frames the single-step generation task into a multi-step process of "generate-understand-regenerate". By explicitly leveraging the model's understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next-generation unified multimodal models. Code is available at https://github.com/sen-ye/R3.
翻译:当前多模态模型研究面临一个关键挑战:提升生成能力往往以牺牲理解为代价,反之亦然。我们分析了这种权衡关系,发现其根本原因可能在于生成与理解之间的潜在冲突,这种冲突在模型内部形成了竞争动态。为解决这一问题,我们提出了Reason-Reflect-Refine(R3)框架。这一创新算法将单步生成任务重构为“生成-理解-再生成”的多步过程。通过在生成过程中显式利用模型的理解能力,我们成功缓解了优化困境,实现了更优的生成效果,并提升了与生成过程相关的理解能力。这为设计下一代统一多模态模型提供了重要启示。代码发布于https://github.com/sen-ye/R3。