The remarkable efficacy of text-to-image diffusion models has motivated extensive exploration of their potential application in video domains. Zero-shot methods seek to extend image diffusion models to videos without necessitating model training. Recent methods mainly focus on incorporating inter-frame correspondence into attention mechanisms. However, the soft constraint imposed on determining where to attend to valid features can sometimes be insufficient, resulting in temporal inconsistency. In this paper, we introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. This enhancement ensures a more consistent transformation of semantically similar content across frames. Beyond mere attention guidance, our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video, significantly improving the visual coherence of the resulting translated videos. Extensive experiments demonstrate the effectiveness of our proposed framework in producing high-quality, coherent videos, marking a notable improvement over existing zero-shot methods.
翻译:文本到图像扩散模型的显著有效性激发了对其在视频领域潜在应用的广泛探索。零样本方法旨在将图像扩散模型扩展到视频领域,而无需进行模型训练。现有方法主要关注在注意力机制中融入帧间对应关系。然而,这种对有效特征关注位置的软约束有时并不充分,导致时间不一致性。本文提出了FRESCO,通过引入帧内对应关系与帧间对应关系,建立更鲁棒的时空约束。这种增强确保了跨帧语义相似内容的一致性变换。除了简单的注意力引导,我们的方法还涉及对特征的显式更新,以实现与输入视频的高度时空一致性,显著增强了生成翻译视频的视觉连贯性。大量实验表明,所提出框架在生成高质量、连贯视频方面具有有效性,相较于现有零样本方法有显著改进。