Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.
翻译:近期视觉生成模型在逼真度、文字排版、指令遵循和交互式编辑方面取得了重大进展,但在空间推理、持续状态、长期一致性和因果理解方面仍存在困难。我们认为该领域应超越外观合成,迈向智能视觉生成:生成基于结构、动态、领域知识和因果关系的合理视觉内容。为框架这一转变,我们引入五级分类体系:原子生成、条件生成、上下文内生成、智能体生成和世界建模生成,从被动渲染器逐步演进为交互式、智能体化、具备世界感知能力的生成器。我们分析了关键技术驱动力,包括流匹配、统一理解与生成模型、改进的视觉表征、后训练、奖励建模、数据整理、合成数据蒸馏及采样加速。我们进一步指出,当前评估往往因侧重感知质量而忽视结构、时序和因果缺陷,从而高估了进展。通过结合基准评测、真实场景压力测试和专家约束案例研究,本路线图提供了一个以能力为中心的视角,用以理解、评估和推动下一代智能视觉生成系统的发展。