Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.
翻译:视频运动传递旨在根据文本提示生成视觉内容,同时传递参考视频中观测到的运动模式,从而合成视频。现有方法主要采用扩散Transformer(DiT)架构。为实现满意的运行速度,已有若干方法尝试加速DiT中的计算,但未能解决结构性的低效根源。本工作识别并消除了早期工作中的两类计算冗余:运动冗余源于通用DiT架构未考虑帧间运动具有小幅度且平滑的特性;梯度冗余则因忽略梯度沿扩散轨迹变化缓慢而产生。为缓解运动冗余,我们对相应注意力层进行局部邻域掩码,避免为空间距离过远的图像区域计算交互权重。为利用梯度冗余,我们设计了一种优化方案,复用先前扩散步的梯度并跳过不必要的梯度计算。平均而言,FastVMT在保持生成视频视觉保真度与时间一致性的前提下,实现了3.43倍的加速效果。