Motion-based controllable video generation offers the potential for creating captivating visual content. Existing methods typically necessitate model training to encode particular motion cues or incorporate fine-tuning to inject certain motion patterns, resulting in limited flexibility and generalization. In this work, we propose MotionClone, a training-free framework that enables motion cloning from reference videos to versatile motion-controlled video generation, including text-to-video and image-to-video. Based on the observation that the dominant components in temporal-attention maps drive motion synthesis, while the rest mainly capture noisy or very subtle motions, MotionClone utilizes sparse temporal attention weights as motion representations for motion guidance, facilitating diverse motion transfer across varying scenarios. Meanwhile, MotionClone allows for the direct extraction of motion representation through a single denoising step, bypassing the cumbersome inversion processes and thus promoting both efficiency and flexibility. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.
翻译:基于运动的可控视频生成为创作引人入胜的视觉内容提供了潜力。现有方法通常需要模型训练以编码特定运动线索,或通过微调注入特定运动模式,导致灵活性与泛化能力受限。本文提出MotionClone,一种无需训练即可将参考视频中的运动克隆至多样化运动控制视频生成(包括文本到视频与图像到视频)的框架。基于对时序注意力图中主导成分驱动运动合成、而其余成分主要捕获噪声或极细微运动的观察,MotionClone利用稀疏时序注意力权重作为运动表征进行运动引导,促进不同场景间的多样化运动迁移。同时,MotionClone允许通过单次去噪步骤直接提取运动表征,避免了繁琐的反演过程,从而提升了效率与灵活性。大量实验表明,MotionClone在全局相机运动与局部物体运动方面均表现出色,在运动保真度、文本对齐及时序一致性方面具有显著优势。