In recent years, diffusion models have emerged as the most powerful approach in image synthesis. However, applying these models directly to video synthesis presents challenges, as it often leads to noticeable flickering contents. Although recently proposed zero-shot methods can alleviate flicker to some extent, we still struggle to generate coherent videos. In this paper, we propose DiffSynth, a novel approach that aims to convert image synthesis pipelines to video synthesis pipelines. DiffSynth consists of two key components: a latent in-iteration deflickering framework and a video deflickering algorithm. The latent in-iteration deflickering framework applies video deflickering to the latent space of diffusion models, effectively preventing flicker accumulation in intermediate steps. Additionally, we propose a video deflickering algorithm, named patch blending algorithm, that remaps objects in different frames and blends them together to enhance video consistency. One of the notable advantages of DiffSynth is its general applicability to various video synthesis tasks, including text-guided video stylization, fashion video synthesis, image-guided video stylization, video restoring, and 3D rendering. In the task of text-guided video stylization, we make it possible to synthesize high-quality videos without cherry-picking. The experimental results demonstrate the effectiveness of DiffSynth. All videos can be viewed on our project page. Source codes will also be released.
翻译:近年来,扩散模型已成为图像合成中最强大的方法。然而,将这些模型直接应用于视频合成会带来挑战,因为其常导致明显的闪烁内容。尽管近期提出的零样本方法能在一定程度上缓解闪烁问题,但我们仍难以生成连贯的视频。本文提出DiffSynth,一种将图像合成管道转化为视频合成管道的新方法。DiffSynth包含两个关键组件:隐空间逐迭代去闪烁框架和视频去闪烁算法。其中,隐空间逐迭代去闪烁框架将视频去闪烁操作应用于扩散模型的隐空间,有效防止中间步骤中的闪烁累积。此外,我们提出了一种名为"补丁混合算法"的视频去闪烁算法,该算法可重新映射不同帧中的物体并将其融合,以增强视频一致性。DiffSynth的一个显著优势是其广泛适用于各类视频合成任务,包括文本引导视频风格化、时尚视频合成、图像引导视频风格化、视频修复以及3D渲染。在文本引导视频风格化任务中,我们无需筛选即可合成高质量视频。实验结果证明了DiffSynth的有效性。所有视频均可通过项目页面查看,源代码也将一并发布。