Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through a dual reasoning paradigm. We formulate generation as world knowledge-enhanced planning to inject implicit constraints, and leverage editing capabilities for fine-grained visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared representation, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for planning, alongside an agent-generated corpus for visual self-correction. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.
翻译:统一的多模态模型在处理需要深度推理的复杂合成任务时常常面临困难,并且通常将文本到图像生成和图像编辑视为孤立的能力,而非相互关联的推理步骤。为解决这一问题,我们提出了UniReason,一个通过双重推理范式协调这两项任务的统一框架。我们将生成任务形式化为世界知识增强的规划过程,以注入隐式约束,并利用编辑能力进行细粒度的视觉精炼,通过自我反思进一步纠正视觉错误。该方法将生成和编辑统一在一个共享的表征中,模拟了人类先规划后精炼的认知过程。我们通过系统性地构建一个大规模以推理为中心的数据集(约30万个样本)来支持该框架,该数据集涵盖规划所需的五个主要知识领域(例如,文化常识、物理学等),同时包含一个用于视觉自我纠正的智能体生成语料库。大量实验表明,UniReason在WISE、KrisBench和UniREditBench等推理密集型基准测试中取得了先进的性能,同时保持了卓越的通用合成能力。