We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal called the Attention Motion Flow (AMF). We guide the latent denoising process in an optimization-based, training-free, manner by optimizing latents with our AMF loss to generate videos reproducing the motion of the reference one. We also apply our optimization strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow against recently published methods, outperforming all across multiple metrics and human evaluation.
翻译:我们提出DiTFlow,一种将参考视频的运动迁移至新合成视频的方法,该方法专为扩散Transformer(DiT)设计。我们首先使用预训练的DiT处理参考视频,分析其跨帧注意力图,并提取一种称为注意力运动流(AMF)的块级运动信号。我们通过基于优化的、无需训练的方式,利用AMF损失优化潜在表示,从而指导潜在去噪过程,以生成再现参考视频运动的新视频。我们还将该优化策略应用于Transformer的位置嵌入,从而提升了零样本运动迁移的能力。我们将DiTFlow与近期发表的方法进行了比较评估,其在多项指标和人工评估中均优于所有对比方法。