BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons. First, it requires huge memory and computation overhead to train a video generation foundation model. Even with video foundation models, additional costly training is still required for downstream video synthesis tasks. Second, although some works extend image diffusion models into videos in a training-free manner, temporal consistency cannot be well preserved. Finally, these adaption methods are specifically designed for one task and fail to generalize to different tasks. To mitigate these issues, we propose a training-free general-purpose video synthesis framework, coined as {\bf BIVDiff}, via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models (e.g., VidRD and ZeroScope) for temporal smoothing. This decoupled framework enables flexible image model selection for different purposes with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff, we perform a wide range of video synthesis tasks, including controllable video generation, video editing, video inpainting, and outpainting.

翻译：扩散模型在文本驱动的图像和视频生成领域取得了巨大进展。目前，文本到图像的基础模型已广泛应用于各类下游图像合成任务，如可控图像生成和图像编辑，而下游视频合成任务却因以下原因研究较少。首先，训练视频生成基础模型需要巨大的内存和计算开销。即使使用视频基础模型，下游视频合成任务仍需要额外的昂贵训练。其次，尽管部分工作以免训练方式将图像扩散模型扩展到视频领域，但时间一致性难以得到良好保持。最后，这些适配方法均为特定任务设计，无法泛化至不同任务。为解决这些问题，我们提出一种免训练的通用视频合成框架，称为{\bf BIVDiff}，通过桥接特定图像扩散模型与通用文本到视频基础扩散模型。具体而言，我们首先使用特定图像扩散模型（如ControlNet和Instruct Pix2Pix）逐帧生成视频，随后对生成视频执行混合反演，最后将反演后的潜在特征输入视频扩散模型（如VidRD和ZeroScope）以实现时间平滑。这种解耦框架允许针对不同目的灵活选择图像模型，具备强大的任务泛化能力和高效性。为验证BIVDiff的有效性和通用性，我们执行了多种视频合成任务，包括可控视频生成、视频编辑、视频修补与扩展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日