Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons. First, it requires huge memory and computation overhead to train a video generation foundation model. Even with video foundation models, additional costly training is still required for downstream video synthesis tasks. Second, although some works extend image diffusion models into videos in a training-free manner, temporal consistency cannot be well preserved. Finally, these adaption methods are specifically designed for one task and fail to generalize to different tasks. To mitigate these issues, we propose a training-free general-purpose video synthesis framework, coined as {\bf BIVDiff}, via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models (e.g., VidRD and ZeroScope) for temporal smoothing. This decoupled framework enables flexible image model selection for different purposes with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff, we perform a wide range of video synthesis tasks, including controllable video generation, video editing, video inpainting, and outpainting.
翻译:扩散模型在文本驱动的图像和视频生成领域取得了巨大进展。目前,文本到图像的基础模型已广泛应用于各类下游图像合成任务,如可控图像生成和图像编辑,而下游视频合成任务却因以下原因研究较少。首先,训练视频生成基础模型需要巨大的内存和计算开销。即使使用视频基础模型,下游视频合成任务仍需要额外的昂贵训练。其次,尽管部分工作以免训练方式将图像扩散模型扩展到视频领域,但时间一致性难以得到良好保持。最后,这些适配方法均为特定任务设计,无法泛化至不同任务。为解决这些问题,我们提出一种免训练的通用视频合成框架,称为{\bf BIVDiff},通过桥接特定图像扩散模型与通用文本到视频基础扩散模型。具体而言,我们首先使用特定图像扩散模型(如ControlNet和Instruct Pix2Pix)逐帧生成视频,随后对生成视频执行混合反演,最后将反演后的潜在特征输入视频扩散模型(如VidRD和ZeroScope)以实现时间平滑。这种解耦框架允许针对不同目的灵活选择图像模型,具备强大的任务泛化能力和高效性。为验证BIVDiff的有效性和通用性,我们执行了多种视频合成任务,包括可控视频生成、视频编辑、视频修补与扩展。