The evolution of diffusion models has greatly impacted video generation and understanding. Particularly, text-to-video diffusion models (VDMs) have significantly facilitated the customization of input video with target appearance, motion, etc. Despite these advances, challenges persist in accurately distilling motion information from video frames. While existing works leverage the consecutive frame residual as the target motion vector, they inherently lack global motion context and are vulnerable to frame-wise distortions. To address this, we present Spectral Motion Alignment (SMA), a novel framework that refines and aligns motion vectors using Fourier and wavelet transforms. SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics, and mitigating spatial artifacts. Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
翻译:扩散模型的发展深刻影响了视频生成与理解技术。特别是,文本到视频扩散模型(VDMs)显著促进了对输入视频的目标外观、运动等特性的定制化。尽管取得了这些进展,从视频帧中精确提取运动信息仍面临挑战。现有方法通常将连续帧残差作为目标运动向量,但这种方式本质上缺乏全局运动上下文,且易受逐帧扭曲的影响。为解决这一问题,我们提出了频谱运动对齐(Spectral Motion Alignment, SMA)——一种利用傅里叶变换和小波变换对运动向量进行精化与对齐的新型框架。SMA通过引入频域正则化学习运动模式,能够促进整体帧的全局运动动态学习,并抑制空间伪影。大量实验表明,SMA在提升运动迁移效果的同时,保持了计算效率,并且与各类视频定制框架具有良好的兼容性。