Text-to-video models have demonstrated impressive capabilities in producing diverse and captivating video content, showcasing a notable advancement in generative AI. However, these models generally lack fine-grained control over motion patterns, limiting their practical applicability. We introduce MotionFlow, a novel framework designed for motion transfer in video diffusion models. Our method utilizes cross-attention maps to accurately capture and manipulate spatial and temporal dynamics, enabling seamless motion transfers across various contexts. Our approach does not require training and works on test-time by leveraging the inherent capabilities of pre-trained video diffusion models. In contrast to traditional approaches, which struggle with comprehensive scene changes while maintaining consistent motion, MotionFlow successfully handles such complex transformations through its attention-based mechanism. Our qualitative and quantitative experiments demonstrate that MotionFlow significantly outperforms existing models in both fidelity and versatility even during drastic scene alterations.
翻译:文本到视频模型在生成多样且引人入胜的视频内容方面展现出卓越能力,标志着生成式人工智能的显著进步。然而,这些模型通常缺乏对运动模式的细粒度控制,限制了其实际应用。我们提出了MotionFlow,一种专为视频扩散模型中运动迁移设计的新型框架。该方法利用交叉注意力图来精确捕捉和操纵时空动态,从而实现在不同场景间的无缝运动迁移。我们的方法无需训练,通过利用预训练视频扩散模型的固有能力,在测试时即可工作。与那些在保持运动一致性的同时难以应对全面场景变化的传统方法相比,MotionFlow通过其基于注意力的机制,成功处理了此类复杂的转换。我们的定性与定量实验表明,即使在剧烈的场景变化下,MotionFlow在保真度和多功能性方面均显著优于现有模型。