Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

Diffusion models have shown impressive performance in many visual generation and manipulation tasks. Many existing methods focus on training a model for a specific task, especially, text-to-video (T2V) generation, while many other works focus on finetuning the pretrained T2V model for image-to-video (I2V), video-to-video (V2V), image and video manipulation tasks, etc. However, training a strong T2V foundation model requires a large amount of high-quality annotations, which is very costly. In addition, many existing models can perform only one or several tasks. In this work, we introduce a unified framework, namely many-for-many, which leverages the available training data from many different visual generation and manipulation tasks to train a single model for those different tasks. Specifically, we design a lightweight adapter to unify the different conditions in different tasks, then employ a joint image-video learning strategy to progressively train the model from scratch. Our joint learning leads to a unified visual generation and manipulation model with improved video generation performance. In addition, we introduce depth maps as a condition to help our model better perceive the 3D space in visual generation. Two versions of our model are trained with different model sizes (8B and 2B), each of which can perform more than 10 different tasks. In particular, our 8B model demonstrates highly competitive performance in video generation tasks compared to open-source and even commercial engines. Our models and source codes are available at https://github.com/leeruibin/MfM.git.

翻译：扩散模型在众多视觉生成与编辑任务中展现出卓越性能。现有方法大多针对特定任务训练独立模型，尤其集中于文本到视频（T2V）生成，而其他研究则侧重于对预训练T2V模型进行微调，以适配图像到视频（I2V）、视频到视频（V2V）以及图像与视频编辑等任务。然而，训练强大的T2V基础模型需要大量高质量标注数据，成本极高。此外，现有模型通常仅能执行单一或有限任务。本研究提出统一框架“多对多”，通过整合来自不同视觉生成与编辑任务的可用训练数据，训练单一模型以执行多种任务。具体而言，我们设计了轻量级适配器来统一不同任务中的条件输入，并采用联合图像-视频学习策略从头渐进式训练模型。该联合学习策略最终形成统一的视觉生成与编辑模型，其视频生成性能显著提升。同时，我们引入深度图作为条件输入，以增强模型在视觉生成中对三维空间的感知能力。我们训练了两个不同规模的模型版本（80亿参数与20亿参数），每个版本均可执行超过10种不同任务。特别指出，我们的80亿参数模型在视频生成任务中展现出与开源乃至商业引擎相竞争的高性能。模型与源代码已发布于https://github.com/leeruibin/MfM.git。