We present Omni-Video 2, a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing. Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions. In this way, the rich contextual representations from the understanding model are directly used to guide the generative process, thereby improving performance on complex and compositional editing. Moreover, a lightweight adapter is developed to inject multimodal conditional tokens into pretrained text-to-video diffusion models, allowing maximum reuse of their powerful generative priors in a parameter-efficient manner. Benefiting from these designs, we scale up Omni-Video 2 to a 14B video diffusion model on meticulously curated training data with quality, supporting high quality text-to-video generation and various video editing tasks such as object removal, addition, background change, complex motion editing, \emph{etc.} We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation. The results demonstrate its superior ability to follow complex compositional instructions in video editing, while also achieving competitive or superior quality in video generation tasks.
翻译:我们提出了Omni-Video 2,这是一个可扩展且计算高效的模型,它将预训练的多模态大语言模型(MLLMs)与视频扩散模型相连接,以实现统一的视频生成与编辑。我们的核心思想是利用MLLM的理解与推理能力,生成明确的目标描述以解释用户指令。通过这种方式,来自理解模型的丰富上下文表征被直接用于指导生成过程,从而提升了在复杂组合式编辑任务上的性能。此外,我们开发了一个轻量级适配器,用于将多模态条件标记注入到预训练的文本到视频扩散模型中,从而以参数高效的方式最大限度地复用其强大的生成先验。得益于这些设计,我们在精心策划的高质量训练数据上,将Omni-Video 2扩展为一个140亿参数的视频扩散模型,支持高质量的文本到视频生成以及多种视频编辑任务,例如物体移除、添加、背景更改、复杂运动编辑等。我们在FiVE基准测试(用于细粒度视频编辑)和VBench基准测试(用于文本到视频生成)上评估了Omni-Video 2的性能。结果表明,其在视频编辑中遵循复杂组合指令方面具有卓越能力,同时在视频生成任务中也达到了具有竞争力或更优的质量。