Latent video diffusion models generate videos by progressively transforming Gaussian noise into realistic samples conditioned on text or visual inputs. However, existing conditioning methods often require additional training and computational overhead. Motivated by recent findings on the importance of frequency components in generative models, we propose a simple, training-free approach for motion-conditioned video generation by injecting low-frequency phase information from a reference video directly into the diffusion noise latents. Our method transfers motion cues without modifying the model architecture or inference pipeline. Using several applications, we demonstrate effective control over both appearance and dynamics in generated videos, while achieving competitive or superior results compared to more complex conditioning approaches.
翻译:隐式视频扩散模型通过将高斯噪声逐步转化为基于文本或视觉输入的真实样本以生成视频。然而,现有条件化方法通常需要额外训练并增加计算开销。受近期关于生成模型中频率分量重要性研究的启发,我们提出一种简单且无需训练的运动条件化视频生成方法,通过将参考视频中的低频相位信息直接注入扩散噪声隐空间。该方法在不修改模型架构或推理流程的前提下传递运动线索。通过多项应用实例,我们证明该方法在生成视频的外观与动态特性上均实现了有效控制,且相较于更复杂的条件化方法表现出相当或更优的性能。