This paper presents a controllable text-to-video (T2V) diffusion model, named Video-ControlNet, that generates videos conditioned on a sequence of control signals, such as edge or depth maps. Video-ControlNet is built on a pre-trained conditional text-to-image (T2I) diffusion model by incorporating a spatial-temporal self-attention mechanism and trainable temporal layers for efficient cross-frame modeling. A first-frame conditioning strategy is proposed to facilitate the model to generate videos transferred from the image domain as well as arbitrary-length videos in an auto-regressive manner. Moreover, Video-ControlNet employs a novel residual-based noise initialization strategy to introduce motion prior from an input video, producing more coherent videos. With the proposed architecture and strategies, Video-ControlNet can achieve resource-efficient convergence and generate superior quality and consistent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality. Project Page: https://controlavideo.github.io/
翻译:本文提出了一种名为Video-ControlNet的可控文本到视频扩散模型,该模型能够根据一系列控制信号(如边缘图或深度图)生成视频。Video-ControlNet基于预训练的条件文本到图像扩散模型构建,通过引入时空自注意力机制和可训练的时间层实现高效的跨帧建模。我们提出了一种首帧条件策略,使模型能够将图像域迁移到视频生成,并以自回归方式生成任意长度的视频。此外,Video-ControlNet采用了新颖的基于残差的噪声初始化策略,从输入视频中引入运动先验,从而生成更连贯的视频。借助所提出的架构和策略,Video-ControlNet能够实现资源高效的收敛,并生成具有细粒度控制的高质量且一致的视频。大量实验证明了该方法在多种视频生成任务(如视频编辑和视频风格迁移)中的成功,在一致性和质量方面均优于先前方法。项目页面:https://controlavideo.github.io/