The goal of conditional image-to-video (cI2V) generation is to create a believable new video by beginning with the condition, i.e., one image and text.The previous cI2V generation methods conventionally perform in RGB pixel space, with limitations in modeling motion consistency and visual continuity. Additionally, the efficiency of generating videos in pixel space is quite low.In this paper, we propose a novel approach to address these challenges by disentangling the target RGB pixels into two distinct components: spatial content and temporal motions. Specifically, we predict temporal motions which include motion vector and residual based on a 3D-UNet diffusion model. By explicitly modeling temporal motions and warping them to the starting image, we improve the temporal consistency of generated videos. This results in a reduction of spatial redundancy, emphasizing temporal details. Our proposed method achieves performance improvements by disentangling content and motion, all without introducing new structural complexities to the model. Extensive experiments on various datasets confirm our approach's superior performance over the majority of state-of-the-art methods in both effectiveness and efficiency.
翻译:条件图像到视频(cI2V)生成的目标是通过初始条件(即一张图像和文本描述)创建可信的新视频。以往的cI2V生成方法通常在RGB像素空间中进行操作,但在建模运动一致性和视觉连续性方面存在局限性。此外,在像素空间中生成视频的效率相当低下。本文提出了一种新颖的方法来解决这些挑战,通过将目标RGB像素分解为两个独立组成部分:空间内容和时序运动。具体而言,我们基于3D-UNet扩散模型预测包含运动向量和残差的时序运动。通过显式建模时序运动并将其变形到初始图像上,我们提升了生成视频的时序一致性。这减少了空间冗余,同时强化了时序细节。我们的方法通过解耦内容与运动实现了性能提升,且未引入新的模型结构复杂度。在多个数据集上的大量实验证实,我们的方法在有效性和效率上均优于大多数最先进方法。