Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design preserves the MLLM's original text generation capabilities, enables accurate interpretation of complex multimodal instructions, and maintains visual consistency in the generated content. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as changing the environment or altering materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we released our model and code.
翻译:统一多模态模型在多模态内容生成与编辑方面已展现出良好前景,但当前研究主要局限于图像领域。本文提出UniVideo,一种将统一建模扩展至视频领域的通用框架。UniVideo采用双流设计:结合用于指令理解的多模态大语言模型(MLLM)与用于视频生成的多模态DiT(MMDiT)。该设计既保留了MLLM原有的文本生成能力,又能精准解析复杂多模态指令,同时确保生成内容的视觉一致性。基于此架构,UniVideo将多样化的视频生成与编辑任务统一于单一多模态指令范式下进行联合训练。大量实验表明,UniVideo在文本/图像到视频生成、上下文视频生成及上下文视频编辑任务中,其性能达到或超越了当前最先进的专用基线模型。值得注意的是,UniVideo的统一设计实现了两种形式的泛化能力:首先,通过单条指令整合多项能力,UniVideo支持任务组合(例如将编辑与风格迁移相结合);其次,即使未经过自由形式视频编辑的显式训练,UniVideo仍能将其在大规模图像编辑数据中习得的编辑能力迁移至该场景,成功处理诸如改变视频环境或替换材质等未见指令。除核心功能外,UniVideo还支持基于视觉提示的视频生成——由MLLM解析视觉提示并在合成过程中引导MMDiT。为促进后续研究,我们已公开模型与代码。