Large-scale text-to-video diffusion models have demonstrated an exceptional ability to synthesize diverse videos. However, due to the lack of extensive text-to-video datasets and the necessary computational resources for training, directly applying these models for video stylization remains difficult. Also, given that the noise addition process on the input content is random and destructive, fulfilling the style transfer task's content preservation criteria is challenging. This paper proposes a zero-shot video stylization method named Style-A-Video, which utilizes a generative pre-trained transformer with an image latent diffusion model to achieve a concise text-controlled video stylization. We improve the guidance condition in the denoising process, establishing a balance between artistic expression and structure preservation. Furthermore, to decrease inter-frame flicker and avoid the formation of additional artifacts, we employ a sampling optimization and a temporal consistency module. Extensive experiments show that we can attain superior content preservation and stylistic performance while incurring less consumption than previous solutions. Code will be available at https://github.com/haha-lisa/Style-A-Video.
翻译:大规模文本到视频扩散模型已展现出合成多样化视频的卓越能力。然而,由于缺乏大量文本-视频数据集以及训练所需的必要计算资源,直接应用此类模型进行视频风格化仍存在困难。此外,鉴于输入内容上的噪声添加过程具有随机性和破坏性,满足风格迁移任务的内容保持要求也颇具挑战。本文提出一种名为Style-A-Video的零样本视频风格化方法,该方法利用生成式预训练变压器与图像潜在扩散模型,实现简洁的文本控制视频风格化。我们改进了去噪过程中的引导条件,在艺术表现与结构保持之间建立平衡。为进一步降低帧间闪烁并避免产生额外伪影,我们采用了采样优化与时序一致性模块。大量实验表明,与先前方案相比,我们在取得更优内容保持与风格表现的同时,降低了计算消耗。代码将发布于https://github.com/haha-lisa/Style-A-Video。