Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and object of the input static image) and ensuring smoothness in animated video narratives guided by textual prompts still remains challenging. In this paper, we introduce Cinemo, a novel image animation approach towards achieving better motion controllability, as well as stronger temporal consistency and smoothness. In general, we propose three effective strategies at the training and inference stages of Cinemo to accomplish our goal. At the training stage, Cinemo focuses on learning the distribution of motion residuals, rather than directly predicting subsequent via a motion diffusion model. Additionally, a structural similarity index-based strategy is proposed to enable Cinemo to have better controllability of motion intensity. At the inference stage, a noise refinement technique based on discrete cosine transformation is introduced to mitigate sudden motion changes. Such three strategies enable Cinemo to produce highly consistent, smooth, and motion-controllable results. Compared to previous methods, Cinemo offers simpler and more precise user controllability. Extensive experiments against several state-of-the-art methods, including both commercial tools and research approaches, across multiple metrics, demonstrate the effectiveness and superiority of our proposed approach.
翻译:扩散模型凭借其强大的生成能力,在图像动画领域取得了显著进展。然而,如何在时间维度上保持与输入静态图像的细节信息(如风格、背景及主体)的时空一致性,并确保文本提示引导的动画视频叙事流畅性,仍是当前面临的挑战。本文提出Cinemo,一种新颖的图像动画方法,旨在实现更优的运动可控性、更强的时间一致性与流畅度。总体而言,我们在Cinemo的训练与推理阶段设计了三种有效策略以实现目标。在训练阶段,Cinemo通过运动扩散模型专注于学习运动残差的分布,而非直接预测后续帧。此外,我们提出基于结构相似性指数的策略,使Cinemo能更好地控制运动强度。在推理阶段,引入基于离散余弦变换的噪声优化技术以缓解运动突变。这三种策略使Cinemo能够生成高度一致、流畅且运动可控的结果。相较于现有方法,Cinemo提供了更简洁精确的用户控制机制。通过在多个指标上与包括商业工具和研究方法在内的若干先进方法进行广泛对比实验,验证了本方法的有效性与优越性。