Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.
翻译:视频运动迁移旨在根据文本提示生成视觉内容,同时迁移参考视频中的运动模式,从而合成视频。现有方法主要采用扩散变压器(Diffusion Transformer, DiT)架构。为获得满意的运行时间,部分方法尝试加速DiT中的计算,但未能解决结构性的低效问题。本研究识别并消除了先前工作中两类计算冗余:运动冗余产生于通用DiT架构未考虑帧间运动具有平滑性;梯度冗余则源于忽略梯度沿扩散轨迹缓慢变化的事实。为缓解运动冗余,我们将相应注意力层掩码限制在局部邻域,避免对不必要远距离图像区域计算交互权重。为利用梯度冗余,我们设计了一种优化方案,该方案复用先前扩散步骤的梯度,并跳过不必要的梯度计算。平均而言,FastVMT实现了3.43倍的加速,且不降低生成视频的视觉保真度或时间一致性。