Generating realistic animated videos from static images is an important area of research in computer vision. Methods based on physical simulation and motion prediction have achieved notable advances, but they are often limited to specific object textures and motion trajectories, failing to exhibit highly complex environments and physical dynamics. In this paper, we introduce an open-domain controllable image animation method using motion priors with video diffusion models. Our method achieves precise control over the direction and speed of motion in the movable region by extracting the motion field information from videos and learning moving trajectories and strengths. Current pretrained video generation models are typically limited to producing very short videos, typically less than 30 frames. In contrast, we propose an efficient long-duration video generation method based on noise reschedule specifically tailored for image animation tasks, facilitating the creation of videos over 100 frames in length while maintaining consistency in content scenery and motion coordination. Specifically, we decompose the denoise process into two distinct phases: the shaping of scene contours and the refining of motion details. Then we reschedule the noise to control the generated frame sequences maintaining long-distance noise correlation. We conducted extensive experiments with 10 baselines, encompassing both commercial tools and academic methodologies, which demonstrate the superiority of our method. Our project page: https://wangqiang9.github.io/Controllable.github.io/
翻译:从静态图像生成逼真的动画视频是计算机视觉领域的重要研究方向。基于物理仿真与运动预测的方法已取得显著进展,但这些方法通常局限于特定物体纹理与运动轨迹,难以呈现高度复杂的环境与物理动态。本文提出一种基于运动先验与视频扩散模型的开域可控图像动画方法。该方法通过从视频中提取运动场信息并学习运动轨迹与强度,实现了对可移动区域运动方向与速度的精确控制。当前预训练视频生成模型通常仅能生成极短视频(一般少于30帧)。与此相对,我们提出一种专门针对图像动画任务设计的、基于噪声重调度的高效长时视频生成方法,能够生成超过100帧的视频,同时保持内容场景的一致性与运动协调性。具体而言,我们将去噪过程分解为两个独立阶段:场景轮廓塑造与运动细节优化,随后通过重调度噪声来控制生成帧序列,以维持长距离噪声相关性。我们在包含商业工具与学术方法的10个基线模型上进行了大量实验,结果证明了本方法的优越性。项目页面:https://wangqiang9.github.io/Controllable.github.io/