Leveraging the generative ability of image diffusion models offers great potential for zero-shot video-to-video translation. The key lies in how to maintain temporal consistency across generated video frames by image diffusion models. Previous methods typically adopt cross-frame attention, \emph{i.e.,} sharing the \textit{key} and \textit{value} tokens across attentions of different frames, to encourage the temporal consistency. However, in those works, temporal inconsistency issue may not be thoroughly solved, rendering the fidelity of generated videos limited.%The current state of the art cross-frame attention method aims at maintaining fine-grained visual details across frames, but it is still challenged by the temporal coherence problem. In this paper, we find the bottleneck lies in the unconstrained query tokens and propose a new zero-shot video-to-video translation framework, named \textit{LatentWarp}. Our approach is simple: to constrain the query tokens to be temporally consistent, we further incorporate a warping operation in the latent space to constrain the query tokens. Specifically, based on the optical flow obtained from the original video, we warp the generated latent features of last frame to align with the current frame during the denoising process. As a result, the corresponding regions across the adjacent frames can share closely-related query tokens and attention outputs, which can further improve latent-level consistency to enhance visual temporal coherence of generated videos. Extensive experiment results demonstrate the superiority of \textit{LatentWarp} in achieving video-to-video translation with temporal coherence.
翻译:利用图像扩散模型的生成能力为零样本视频到视频翻译提供了巨大潜力。关键挑战在于如何通过图像扩散模型保持生成视频帧间的时间一致性。现有方法通常采用跨帧注意力机制,即在不同帧的注意力中共享键(key)和值(value)标记以促进时间一致性。然而,这些方法尚未彻底解决时间不一致问题,导致生成视频的保真度受限。本文发现瓶颈在于不受约束的查询(query)标记,并提出一种名为LatentWarp的新型零样本视频到视频翻译框架。我们的方法极为简洁:为约束查询标记的时间一致性,进一步在潜变量空间中引入扭曲操作以限制查询标记。具体而言,基于原始视频的光流估计,我们在去噪过程中将上一帧生成的潜变量特征扭曲对齐至当前帧。由此,相邻帧的对应区域可共享高度相关的查询标记与注意力输出,从而在潜变量层面提升一致性,增强生成视频的视觉时间连贯性。大量实验结果表明,LatentWarp在实现时间连贯的视频到视频翻译任务上具有显著优越性。