Recently, breakthroughs in the video diffusion transformer have shown remarkable capabilities in diverse motion generations. As for the motion-transfer task, current methods mainly use two-stage Low-Rank Adaptations (LoRAs) finetuning to obtain better performance. However, existing adaptation-based motion transfer still suffers from motion inconsistency and tuning inefficiency when applied to large video diffusion transformers. Naive two-stage LoRA tuning struggles to maintain motion consistency between generated and input videos due to the inherent spatial-temporal coupling in the 3D attention operator. Additionally, they require time-consuming fine-tuning processes in both stages. To tackle these issues, we propose Follow-Your-Motion, an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex motion. Specifically, we propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing. During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed. To address the lack of a benchmark for this field, we introduce MotionBench, a comprehensive benchmark comprising diverse motion, including creative camera motion, single object motion, multiple object motion, and complex human motion. We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.
翻译:近年来,视频扩散Transformer在多样化的运动生成任务中展现出显著能力。针对运动迁移任务,现有方法主要采用两阶段低秩适配(LoRA)微调以获取更优性能。然而,当应用于大规模视频扩散Transformer时,现有基于适配器的运动迁移方法仍存在运动不一致和调优效率低下的问题。由于三维注意力算子固有的时空耦合特性,朴素的两阶段LoRA调优难以保持生成视频与输入视频间的运动一致性。此外,两阶段均需耗时精细微调过程。为解决上述问题,我们提出Follow-Your-Motion——一种高效的两阶段视频运动迁移框架,通过对强大视频扩散Transformer进行微调以合成复杂运动。具体而言,我们提出时空解耦LoRA,将注意力架构解耦为空间表观处理与时间运动处理两个模块。在第二阶段训练中,我们设计稀疏运动采样和自适应旋转位置编码(RoPE)以加速调优速度。针对该领域缺乏基准数据集的问题,我们引入MotionBench综合基准,涵盖创意相机运动、单物体运动、多物体运动及复杂人体运动等多种运动模式。在MotionBench上的全面评估验证了Follow-Your-Motion的优越性。