Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling'' training paradigm. Specifically, we first perform Interleaved Concept Learning, which utilizes interleaved text-image data to foster the model's deeper understanding and imaginative interpretation of scripts. We then conduct Disentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.
翻译:现有的AI驱动视频创作系统通常将剧本草拟与关键镜头设计视为两个独立任务:前者依赖大型语言模型,后者则依靠图像生成模型。我们认为这两个任务应统一在单一框架内,因为逻辑推理与想象思维均是电影导演的基本素养。本文提出UniMAGE——一个统一导演模型,它通过连接用户提示与结构完整的剧本,使非专业用户能够利用现有音视频生成模型制作长上下文、多镜头的影片。为实现这一目标,我们采用统一文本与图像生成的Mixture-of-Transformers架构。为进一步增强叙事逻辑与关键帧一致性,我们提出“先交错学习,后解耦训练”的范式:首先进行交错概念学习,利用交错排列的文本-图像数据促进模型对剧本的深度理解与想象诠释;随后实施解耦专家学习,将剧本写作与关键帧生成分离,从而在叙事中实现更高灵活性与创造性。大量实验表明,UniMAGE在开源模型中达到最先进性能,能够生成逻辑连贯的视频剧本与视觉一致的关键帧图像。