We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image's feature to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even at the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region annotations. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation.
翻译:我们提出Motion-I2V,一种用于生成一致可控图像到视频(I2V)的新型框架。与直接学习复杂图像到视频映射的先前方法不同,Motion-I2V将I2V分解为两个阶段并引入显式运动建模。在第一阶段,我们提出基于扩散模型的运动场预测器,专注于推演参考图像像素的运动轨迹。在第二阶段,我们提出运动增强型时间注意力机制,以增强视频潜扩散模型中受限的一维时间注意力。该模块可在第一阶段预测轨迹的引导下,将参考图像特征有效传播至合成帧。相较于现有方法,Motion-I2V即便在存在大幅运动与视角变化的情况下仍能生成更一致的视频。通过为第一阶段训练稀疏轨迹控制网络(ControlNet),Motion-I2V支持用户利用稀疏轨迹与区域标注精确控制运动轨迹与运动区域。相较于仅依赖文本指令,这为I2V过程提供了更强的可控性。此外,Motion-I2V的第二阶段天然支持零样本视频到视频的转换。定性与定量比较均证明Motion-I2V在一致可控图像到视频生成方面优于先前方法。