Recently, breakthroughs in the video diffusion transformer have shown remarkable capabilities in diverse motion generations. As for the motion-transfer task, current methods mainly use two-stage Low-Rank Adaptations (LoRAs) finetuning to obtain better performance. However, existing adaptation-based motion transfer still suffers from motion inconsistency and tuning inefficiency when applied to large video diffusion transformers. Naive two-stage LoRA tuning struggles to maintain motion consistency between generated and input videos due to the inherent spatial-temporal coupling in the 3D attention operator. Additionally, they require time-consuming fine-tuning processes in both stages. To tackle these issues, we propose Follow-Your-Motion, an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex motion. Specifically, we propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing. During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed. To address the lack of a benchmark for this field, we introduce MotionBench, a comprehensive benchmark comprising diverse motion, including creative camera motion, single object motion, multiple object motion, and complex human motion. We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.
翻译:最近,视频扩散Transformer的突破性进展展示了其在多样化运动生成中的卓越能力。针对运动迁移任务,现有方法主要采用两阶段低秩适应(LoRA)微调以获得更优性能。然而,当将这种基于适应的迁移方法应用于大型视频扩散Transformer时,仍存在运动不一致性和调优效率低下的问题。朴素的两阶段LoRA调优因3D注意力运算中的固有时空耦合而难以保持生成视频与输入视频之间的运动一致性,且两个阶段均需耗时进行微调。为解决这些问题,我们提出Follow-Your-Motion——一种高效两阶段视频运动迁移框架,通过微调强大的视频扩散Transformer来合成复杂运动。具体而言,我们提出时空解耦LoRA,将注意力架构解耦为空间外观处理与时间运动处理。在第二阶段训练中,我们设计了稀疏运动采样与自适应旋转位置编码(RoPE)以加速调优速度。针对该领域缺乏基准的问题,我们引入MotionBench——一个包含创造性相机运动、单物体运动、多物体运动及复杂人体运动等多样化运动类型的综合性基准。我们在MotionBench上进行了广泛评估,验证了Follow-Your-Motion的优越性。