Recent advancements in diffusion-based models have demonstrated significant success in generating images from text. However, video editing models have not yet reached the same level of visual quality and user control. To address this, we introduce RAVE, a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training. RAVE takes an input video and a text prompt to produce high-quality videos while preserving the original motion and semantic structure. It employs a novel noise shuffling strategy, leveraging spatio-temporal interactions between frames, to produce temporally consistent videos faster than existing methods. It is also efficient in terms of memory requirements, allowing it to handle longer videos. RAVE is capable of a wide range of edits, from local attribute modifications to shape transformations. In order to demonstrate the versatility of RAVE, we create a comprehensive video evaluation dataset ranging from object-focused scenes to complex human activities like dancing and typing, and dynamic scenes featuring swimming fish and boats. Our qualitative and quantitative experiments highlight the effectiveness of RAVE in diverse video editing scenarios compared to existing methods. Our code, dataset and videos can be found in https://rave-video.github.io.
翻译:近年来,基于扩散的模型在文本生成图像方面取得了显著成功。然而,视频编辑模型在视觉质量和用户控制方面尚未达到同等水平。为解决这一问题,我们提出了RAVE——一种无需额外训练的零样本视频编辑方法,它利用预训练的文本到图像扩散模型。RAVE以输入视频和文本提示为基础,生成高质量视频,同时保留原始运动与语义结构。该方法采用新颖的噪声洗牌策略,利用帧间的时空相互作用,比现有方法更快地生成时间一致视频。它在内存需求方面也具有高效性,从而能够处理更长的视频。RAVE支持从局部属性修改到形状变换的广泛编辑范围。为展示RAVE的通用性,我们构建了一个全面的视频评估数据集,涵盖从以物体为中心的场景到跳舞、打字等复杂人类活动,以及包含游鱼和船只的动态场景。定性与定量实验均表明,与现有方法相比,RAVE在多样化视频编辑场景中的有效性。我们的代码、数据集及视频可在https://rave-video.github.io获取。