We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations. This is different from previous methods which only can work on a specific motion domain or show weak control abilities with diffusion prior. To achieve our goal, we design several domain-aware motion field adapters (\ie, MOFA-Adapters) to control the generated motions in the video generation pipeline. For MOFA-Adapters, we consider the temporal motion consistency of the video and generate the dense motion flow from the given sparse control conditions first, and then, the multi-scale features of the given image are wrapped as a guided feature for stable video diffusion generation. We naively train two motion adapters for the manual trajectories and the human landmarks individually since they both contain sparse information about the control. After training, the MOFA-Adapters in different domains can also work together for more controllable video generation. Project Page: https://myniuuu.github.io/MOFA_Video/
翻译:本文提出MOFA-Video,一种先进的可控图像动画方法,能够基于给定图像并利用多种附加可控信号(如人体关键点参考、手动轨迹甚至另一段给定视频)或其组合来生成视频。这不同于以往仅能在特定运动域工作或依赖扩散先验而控制能力较弱的方法。为实现这一目标,我们设计了多个域感知运动场适配器(即MOFA-Adapters)来控制在视频生成流程中的运动生成。对于MOFA-Adapters,我们考虑视频的时间运动一致性,首先从给定的稀疏控制条件生成稠密运动流,然后将给定图像的多尺度特征进行形变处理,作为稳定视频扩散生成的引导特征。由于手动轨迹和人体关键点均包含稀疏控制信息,我们分别对这两种运动适配器进行了独立训练。训练完成后,不同领域的MOFA-Adapters可协同工作以实现更可控的视频生成。项目页面:https://myniuuu.github.io/MOFA_Video/