We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal called the Attention Motion Flow (AMF). We guide the latent denoising process in an optimization-based, training-free, manner by optimizing latents with our AMF loss to generate videos reproducing the motion of the reference one. We also apply our optimization strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow against recently published methods, outperforming all across multiple metrics and human evaluation.
翻译:我们提出DiTFlow方法,用于将参考视频的运动迁移至新合成视频,该方法专为扩散Transformer(DiT)设计。我们首先使用预训练的DiT处理参考视频,通过分析跨帧注意力图提取一种称为注意力运动流(AMF)的块级运动信号。在潜在去噪过程中,我们以基于优化的免训练方式,通过AMF损失函数优化潜在表示,从而生成复现参考视频运动特征的视频。此外,我们将该优化策略应用于Transformer位置编码,显著提升了零样本运动迁移性能。通过与近期发表方法的对比实验,DiTFlow在多项指标和人工评估中均表现出优越性能。