In this paper, we present ControlVideo, a novel method for text-driven video editing. Leveraging the capabilities of text-to-image diffusion models and ControlNet, ControlVideo aims to enhance the fidelity and temporal consistency of videos that align with a given text while preserving the structure of the source video. This is achieved by incorporating additional conditions such as edge maps, fine-tuning the key-frame and temporal attention on the source video-text pair with carefully designed strategies. An in-depth exploration of ControlVideo's design is conducted to inform future research on one-shot tuning video diffusion models. Quantitatively, ControlVideo outperforms a range of competitive baselines in terms of faithfulness and consistency while still aligning with the textual prompt. Additionally, it delivers videos with high visual realism and fidelity w.r.t. the source content, demonstrating flexibility in utilizing controls containing varying degrees of source video information, and the potential for multiple control combinations. The project page is available at \href{https://ml.cs.tsinghua.edu.cn/controlvideo/}{https://ml.cs.tsinghua.edu.cn/controlvideo/}.
翻译:本文提出了一种新颖的文本驱动视频编辑方法ControlVideo。该方法利用文本到图像扩散模型与ControlNet的能力,旨在增强与给定文本对齐的视频的保真度和时间一致性,同时保留源视频的结构。通过引入边缘图等额外条件,并采用精心设计的策略对源视频-文本对的帧关键帧与时间注意力进行微调,实现了上述目标。本文对ControlVideo的设计进行了深入探索,为未来一次性调优视频扩散模型的研究提供参考。在定量评估中,ControlVideo在保持与文本提示对齐的同时,在忠实度与一致性方面优于多种竞争性基线方法。此外,它生成的视频具有高度视觉真实感和对源内容的保真度,展现出利用包含不同程度源视频信息控制的灵活性,以及多种控制组合的潜力。项目页面见\href{https://ml.cs.tsinghua.edu.cn/controlvideo/}{https://ml.cs.tsinghua.edu.cn/controlvideo/}。