Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling. Besides the training burden, the generated videos also suffer from appearance inconsistency and structural flickers, especially in long video synthesis. To address these challenges, we design a \emph{training-free} framework called \textbf{ControlVideo} to enable natural and efficient text-to-video generation. ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that employs frame interpolation on alternated frames. Finally, to produce long videos efficiently, it utilizes a hierarchical sampler that separately synthesizes each short clip with holistic coherency. Empowered with these modules, ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs quantitatively and qualitatively. Notably, thanks to the efficient designs, it generates both short and long videos within several minutes using one NVIDIA 2080Ti. Code is available at https://github.com/YBYBZhang/ControlVideo.
翻译:文本驱动的扩散模型在图像生成领域展现了前所未有的能力,但其视频生成对应方法由于时序建模的高昂训练成本而仍显滞后。除训练负担外,生成视频还存在外观不一致性和结构闪烁问题,尤其在长视频合成中更为突出。为应对这些挑战,我们设计了一种名为**ControlVideo**的**无需训练**框架,以实现自然高效的文本到视频生成。ControlVideo基于ControlNet,利用输入运动序列的粗粒度结构一致性,并引入三个模块以改进视频生成。首先,为确保帧间外观连贯性,ControlVideo在自注意力模块中增加了全帧交互。其次,为缓解闪烁效应,它引入交错帧平滑器,通过对交替帧进行帧插值处理。最后,为高效生成长视频,它采用分层采样器,分别合成每个具有整体一致性的短视频片段。借助这些模块,ControlVideo在大量运动-提示对上的定量与定性评估中均优于现有最优方法。值得注意的是,得益于其高效设计,使用单张NVIDIA 2080Ti显卡即可在数分钟内生成短/长视频。代码开源地址:https://github.com/YBYBZhang/ControlVideo